Lists of APKs
Everything
Big (>2.7GB compressed) CSV file updated every night (before 6am Luxembourg/Paris time), containing the following fields (not in that order):sha256, sha1, md5, apk_size: Those are what you think they are.
dex_size: The size of the classes.dex file (i.e., ignoring all other dex files)
dex_date: The date attached to the dex file inside the zip (sometimes invalid and/or manipulated) WARNING: the dex_date is mostly unusable nowadays: The vast majority of apps from Google Play have a 1980 dex_date
pkg_name, vercode: the name of the android Package and the version code (as reported in the manifest file).
Note: pkg_name might be unique inside one market (i.e. two apks with the same pkg_name inside google play may have the same developer).
WARNING: There is one bogus APK (BC564D52C6E79E1676C19D9602B1359A33B8714A1DC5FCB8ED602209D0B70266) whose pkg_name contains a ",". Use grep -v ',snaggamea'
to get rid of it.
vt_detection,vt_scan_date: The number of AV from VirusTotal (VT) that detected this apks as a malware on vt_scan_date (if available)
markets: a '|' separated list of the markets where we saw this APK. Note: The absence of a market does NOT mean that an APK was not published on this market. It means we did not see it there.
New: Crawl date
In addition to the CSV described above, we provide another CSV file. It has the same fields, plus an "added" field which contains the date this APK entered AndroZoo. CSV with added field. Please note that an APK could very well have existed for years before it was added to AndroZoo.
Examples of filtering this list:
Note: Those examples are valid for the latest.csv.gz file. For the CSV with the added field, the market is in $12, not in $11. - Select only APKs that comes from Google Play Store:zcat latest.csv.gz | grep -v ',snaggamea' | awk -F, '{if ($11 ~ /play\.google\.com/) {print} }'
- Whose size is over 10 000 000 Bytes:
| awk -F, '{if ($5 >10000000 ) {print} }'
- Detected by at least 2 AntiVirus engines:
| awk -F, '{if ($8 >=2 ) {print} }'
- To filter on dex_date, we can use the fact that the timestamp string used is sortable, i.e. date_1_str > date_2_str only when date_1 is after date_2.
example: only dex_date starting from 2018-12-01:
| awk -F, '{if ( $4 >= "2018-12" ) {print} }'
example 2: only dex_date before 2019-11-30
| awk -F, '{if ( $4 < "2019-11-30" ) {print} }'
- To get only the list of selected sha256:
| cut -d',' -f1 > list_of_selected_sha256
So the whole command would be:
zcat latest.csv.gz | grep -v ',snaggamea' | awk -F, '{if ($11 ~ /play\.google\.com/) {print} }' | awk -F, '{if ($5 >10000000 ) {print} }' | awk -F, '{if ($8 >=2 ) {print} }' | awk -F, '{if ( $4 >= "2018-12" ) {print} }' | awk -F, '{if ( $4 >= "2019-11-30" ) {print} }' | cut -d',' -f1 > list_of_selected_sha256