This post is a summary of the study on “Data flow analysis tools for GDPR accountability compliance”, conducted together with the Spanish Data Protection Agency and available at https://www.aepd.es/sites/default/files/2019-09/estudio-flujos-informacion-android.pdf.
This will be the first of a series of posts about looking at the behaviour of Android applications (“apps”) and analysing privacy and data protection aspects. The first part of this post briefly introduces the concepts of personal data sources and sinks. Then, with the unauthorized disclosure of personal data being one of the most prominent privacy threats, the second part of the post briefly lists a small sample of available tools that you can use to start analyzing personal data flows in Android applications.
First of all, what is an Android application or, in short, an “app”? An app is simply a consumer-oriented software that either provides utilities (e.g. a calculator or a torch) or provides service interfaces (e.g. to access your health history or your bank). The open model of Android allows any individual or organization to (1) develop an app using Java or Kotlin, (2) package all files needed in a format known as APK (Android PacKage) and finally, (3) distribute it worldwide through different platforms, including the official Google Play Store or third-party distributors such as F-Droid, APK Pure or Uptodown. These apps will then be available for any user to download to their mobile devices, most of them for free! (yes, e.g., a little over 96% of Play Store apps are free).
The second question is what are sources and sinks of personal data from an app? Users can download and install a variety of apps, ranging from entertainment and sport to education, banking and health. Using these apps in quite personal and private spaces leads us to share data that we would not do with that person we trust the most :). It can be expected therefore that apps are capable of processing certain information that is personal in nature, derived from different sources, for example:
- Users: Users provide personal information directly to the apps through forms, e.g., some apps often use registration forms to request personal data such as name, address, phone number, among others.
- Sensors: Mobile devices incorporate or can interoperate with sensors of different nature (e.g. GPS, camera, microphone, Wifi, sensors for health and fitness, etc.) that generate a considerable amount of personal data (e.g. location, photographs and personal audio notes, temperature, heart rate, etc.) that can be accessed by the apps.
- Other apps: Personal data belonging to other apps on the device is another source of information that can be accessed by an app through communication mechanisms between processes (e.g. through intents).
- Environment: The hardware and operating system handle internally global identifiers, such as the IP address, IMEI, and MAC Address, as well as metadata and application usage logs (e.g. frequency and time of usage) that could be accessed by apps and then used to track and profile user’s behaviours.
The Android OS is designed to access sensor data and environment identifiers through an API (Application Programming Interface). For example, the code snippet below includes the API methods that an app uses to access the IMEI (getDeviceId) and location (getLastKnownLocation).
|TelephonyManager tm = (TelephonyManager) getSystemService(Context.TELEPHONY_SERVICE);String imei = tm.getDeviceId();|
final LocationManager lm = (LocationManager)getSystemService(Context.LOCATION_SERVICE);Location last = lm.getLastKnownLocation(LocationManager.GPS_PROVIDER);
While there are API methods for accessing personal data sources, there are also API methods for sending that data to sites outside the app (not necessarily outside the mobile device) via, e.g. the file system, the network interface, SMS, NFC (Near Field Communication) or Bluetooth.
|URL url = new URL(“https://etsit.upm.es/?src=” + imei);HttpURLConnection conn = (HttpURLConnection) url.openConnection();InputStream in = conn.getInputStream();|
So far we can summarise two important concepts:
- Source-related API methods allow for accessing external resources (external to the app, not necessarily external to the mobile device) from which data is read. E.g. Device ID, current location, contacts, and photos.
- Sink-related API methods allow for accessing external resources to which data is written. E.g. Internet, SMS, file system, NFC (Near Field Communication), and Bluetooth.
Let’s move on! The third question is why are sources and sinks relevant for assessing privacy and data protection issues? Broadly speaking, apps may disclose personal data from sources not authorised by the user (or not legally permitted) to sinks not authorised by the user (or not legally permitted). This can occur as some app operations can be opaque even for developers, when third-party libraries are integrated during development. We will explain this further in another post, for now, it is enough to know that multiple third-party libraries can be integrated into apps, as they enable features such as analytics, social network integration, app monetization through ads, etc. Finally, be aware that disclosure can be across the world, even to places where your local legislation does not protect you!
Well, the final question is what techniques and tools can be used to detect personal data flows between sensitive sources and sinks? There are multiple techniques and tools that we could classify into three categories: static analysis, dynamic analysis, and hybrid analysis. In a nutshell, all three seek to find connections between sources and sinks. The static analysis takes the app’s source or intermediate code as input, examines it without running, and estimates those on reachable sources-sinks. On the other hand, the dynamic analysis relies on app behaviour during its real execution. This is achieved by generating a finite set of events that stimulate the program, capture and store the generated logs, and finally analyze reachable sources-sinks. Traffic analysis is a particular type of dynamic analysis that focuses on the communications carried out by the app, analysing both the metadata (e.g. recipients) and the data transmitted. Finally, a hybrid analysis combined both static and dynamic analysis. The table below shows a very small sample of tools available in each type, details can be found in this complete report.
|Tools to identify sources and sinks||PScout, Axplorer, SuSi|
|Tools to flow analysis||Androguard, Soot, FlowDroid, Epicc, IccTA|
|Tools to event generation||Exerciser Monkey, Monkey Runner, UIHarvester (Reaper)|
|Tools to dynamic instrumentation||Frida, Xposed Framework, Cydia Substrate|
|Tools to traffic interception||Lumen, MITMProxy|
About Danny Guamán
Ph.D. student at Technical University of Madrid (Spain). Auxiliary professor at Escuela Politécnica Nacional (Ecuador). Current interests: Privacy in Cloud Computing and IoT.