#archiveteam-bs 2016-01-25,Mon

↑back Search

Time Nickname Message
00:38 🔗 wyatt8740 has quit IRC (Read error: Operation timed out)
00:39 🔗 wyatt8740 has joined #archiveteam-bs
00:48 🔗 kyan has joined #archiveteam-bs
00:51 🔗 godane has quit IRC (Read error: Operation timed out)
00:53 🔗 JesseW OK, so now I have two sorted, 3-column tab-separate-value files -- of 15 and 25 gigabytes in size, respectively. I want to know which lines in them have been added, removed or changed. What's a sensible way to do this?
00:54 🔗 pikhq comm
00:56 🔗 JesseW pikhq: for a 25gigabyte file?
00:56 🔗 pikhq comm, unlike diff, works line-by-line and assumes the inputs are sorted.
00:56 🔗 JesseW hm, ok, will try it
01:01 🔗 JesseW hm -- it does work quickly, and incrementally -- but I'm having some trouble figuring out how best to interpret the results
01:02 🔗 pikhq The -1, -2, and -3 options might help. :)
01:02 🔗 JesseW I've done comm -3 to remove the matching lines
01:02 🔗 pikhq Mmkay.
01:02 🔗 JesseW but it's still tricky to go from something like:
01:03 🔗 JesseW 0000000000002 0000000000002_archive.torrent 2f0355c11a6bc8dceffecbf7d46e5dce
01:03 🔗 JesseW 0000000000002 0000000000002_archive.torrent e929a012ec62c2ee021dfc5a71e749c2
01:03 🔗 JesseW 0000000000002 0000000000002_meta.xml 2c1c2c37230e5390b26651ebdf2b84c6
01:03 🔗 JesseW 0000000000002 0000000000002_meta.xml cbfcd60fecf27a9c71ab794fa8ebff74
01:03 🔗 JesseW to noticing that the torrent and meta files have changed hashes
01:04 🔗 JesseW I'd like to hack up a display of the above that made that more explicit...
01:04 🔗 pikhq Hmm.
01:29 🔗 wp494 has quit IRC (LOUD UNNECESSARY QUIT MESSAGES)
01:37 🔗 acridAxid has joined #archiveteam-bs
01:39 🔗 JesseW I think what I want for output is something like:
01:39 🔗 JesseW 0000000000002 0000000000002_meta.xml CHANGED
01:39 🔗 JesseW 0000000000002 thing.txt ADDED
01:39 🔗 JesseW maybe I can do that with sed...
01:49 🔗 toad1 has joined #archiveteam-bs
01:50 🔗 toad2 has quit IRC (Read error: Operation timed out)
01:57 🔗 VADemon has quit IRC (Quit: left4dead)
02:15 🔗 dashcloud JesseW: you're documenting this process (if only for yourself when you try to do it again) I hope?
02:15 🔗 acridAxid has quit IRC (Read error: Operation timed out)
02:19 🔗 username1 has joined #archiveteam-bs
02:22 🔗 schbirid2 has quit IRC (Read error: Operation timed out)
02:29 🔗 JesseW dashcloud: more or less, yeah
02:31 🔗 brayden has joined #archiveteam-bs
02:42 🔗 acridAxid has joined #archiveteam-bs
03:42 🔗 JesseW Yay, I've hacked up a sed script that gives the nicer display I wanted!
03:42 🔗 JesseW comm --output-delimiter='!!!' -3 public-file-hashes_20150304205357_sorted.tsv public-file-hashes_20150304205357_recheck_20160120112813.tsv | head -n 100 | sed -ne $'/^!!!$/d\nN\ns/^!*\([^\\t]*\)\\t\([^\\t]*\)\\t\([0-9a-f]*\)\\n!*\\1\\t\\2\\t\([0-9a-f]*\)$/@ CHANGED\\t\\1\\t\\2 FROM \\4 TO \\3/\ns/^\([^!@][^\\t]*\\t[^\\t]*\).*$/@ REMOVED\\t\\1/\ns/^!!!\([^\\t]*\\t[^\\t]*\).*$/@ ADDED \\t\\1/\np\nD'
03:43 🔗 JesseW enjoy the line noise...
03:46 🔗 yipdw at some point python perl or ruby do begin to have more relevance
03:52 🔗 JesseW sure. But sed is more fun (in some sense) :-)
03:53 🔗 JesseW Once we get a regular schedule of census's going on, I'll probably write a python program to do it.
04:01 🔗 JesseW hm -- the previous census left out the wayback data (admittedly, there was a note to that effect). Interestingly, although it isn't downloadable, the hashes for the files *are* available -- so my census grabbed them all, and is now reporting ALLOFTHEM as new files. :-)
04:04 🔗 JesseW I think it's about 10 GIGABYTES of *METADATA*. :-)
04:05 🔗 JesseW hashes of wayback files, I mean.
04:07 🔗 phuzion daaaaaaaang
04:20 🔗 acridAxid has quit IRC (Read error: Connection reset by peer)
04:29 🔗 JesseW but it's really good that they make those hashes available, because that way, we can distribute them, and when someone comes to IA saying "secretly change this thing on the Wayback Machine or else" -- IA can point to the external distribution of the (reported) hashes and say, "sure, but it'll get discovered in a couple months, and then where will you be?"
04:30 🔗 JesseW (admittedly, they could still falsify the reporting of the hashes -- but that would require manual changes to the code, and the equivalent of an accountant keeping 2 sets of books -- which in itself would be a lot more obvious to a whole lot more people *inside* IA)
04:34 🔗 JesseW wow, there are over 600,000 identifiers on IA that start with a digit
04:38 🔗 JesseW and there's about 1.6 GB of hash metadata only in my newer result in there.
04:38 🔗 phuzion <phuzion> daaaaaaaang
04:40 🔗 MrRadar JesseW: Does the IA provide secure hashes or just MD5? While that's a laudable goal MD5 is so broken these days that it provides almost no cryptographic security
04:41 🔗 JesseW They report md5, sha1 and crc32, generally.
04:42 🔗 MrRadar That's slightly better
04:42 🔗 JesseW Unfortunately, the initial census only stored the md5, which is why I decided to follow that in this one.
04:42 🔗 MrRadar Though SHA-1 is on death's doorstep
04:43 🔗 JesseW And, given that much of this stuff is human-readable, I'm not so sure the breakage of md5 is as significant.
04:43 🔗 MrRadar Yeah, finding human-readable junk to add is probably harder than finding any junk to fudge the checksum
04:43 🔗 MrRadar But I would still like to see non-fudgeable checksums to begin with
04:43 🔗 MrRadar But thanks for doing this work
04:44 🔗 JesseW Or has it gotten broken enough that say, one can take an image of Stalin & Trotsky, remove Trotsky, and end up with the same md5 hash?
04:44 🔗 espes___ fuck no
04:44 🔗 MrRadar To that point, yes: http://natmchugh.blogspot.com/2014/10/how-i-created-two-images-with-same-md5.html
04:44 🔗 JesseW MrRadar: I'm just taking it to the next step -- Jake at IA did the original work.
04:45 🔗 JesseW heh, wow
04:45 🔗 espes___ you can generate colliding data, but that's far from forcing a hash
04:45 🔗 espes___ that's just a trick and only requries one colliding block
04:46 🔗 JesseW yeah, I saw that it sneaks the necessary junk in what is effectively a comment
04:48 🔗 JesseW but most formats do *have* comments...
04:49 🔗 espes___ again, finding a collision != finding a preimage
04:50 🔗 JesseW yep
04:55 🔗 JesseW here's a perfectly innocuous change: https://catalogd.archive.org/log/423306379 -- which was detected by my check
04:56 🔗 JesseW @ CHANGED 02196788.1207.emory.edu 02196788.1207.emory.edu_meta.xml FROM b3db8dd19bc7230af632e5ac02a5e41c TO 20604ea11a6ec1266c5c4a7fa8d3d500
05:00 🔗 yipdw JesseW: you'll find at least one md5 collision in ArchiveBot data
05:01 🔗 yipdw in particular I'm pretty sure we have a capture of http://natmchugh.blogspot.com/2014/10/images-with-colliding-md5-hash.html
05:02 🔗 yipdw ah yes http://archive.fart.website/archivebot/viewer/job/eeumo
05:02 🔗 yipdw oh wait MrRadar already posted that, go me for not reading
05:02 🔗 * yipdw is digital native
05:04 🔗 xmc curious how many bytes of duplicate files exist in IA
05:04 🔗 xmc the question that everybody new here asks :P
05:04 🔗 yipdw 1, for an appropriately sized byte
05:05 🔗 xmc for greater efficiency in transmission, we will deliver first all of the 0 bits and then all of the 1 bits.
05:05 🔗 JesseW xmc: already answered (imperfectly) on the census page: http://archiveteam.org/index.php?title=Internet_Archive_Census
05:05 🔗 xmc o
05:06 🔗 xmc 1PB or so
05:06 🔗 xmc sounds about right
05:06 🔗 JesseW There are 22,596,286 files which are copies of other files. The duplicate files take up 1.06PB of space. (Assuming all files with the same MD5 are duplicates.)
05:06 🔗 yipdw not bad
05:06 🔗 JesseW well, there's a bunch more duplication -- that's just what was counted, there
05:07 🔗 xmc aye
05:07 🔗 JesseW AFAIK, everything is duplicated once, for backup
05:07 🔗 xmc yep
05:07 🔗 JesseW and some is duplicated in Egypt
05:07 🔗 xmc one copy on each of two different nodes
05:08 🔗 JesseW and lots of stuff is duplicated but broken up into files differently
05:09 🔗 JesseW i.e. I know archivebot has hundreds of thousands of copies of some standard google javascript libraries -- none of which is counted there
05:09 🔗 JesseW because it's all inside of WARCs that aren't identical in full
05:09 🔗 yipdw don't petabyte if its end is bristling
05:10 🔗 JesseW if a byte becomes infected, it may be necessary to kilabyte
05:10 🔗 acridAxid has joined #archiveteam-bs
05:12 🔗 yipdw I think it would be fun to do a comparison on the zillions of copies of jQuery
05:12 🔗 yipdw group by version and see how many are actually identical, for increasingly sophisticated definitions of "identical"
05:12 🔗 yipdw someone who wants to get into static analysis of Javascript might have some fun there
05:13 🔗 JesseW lol
05:13 🔗 yipdw of course the group-by part might be difficult
05:13 🔗 JesseW yes, that could be rather entertaining
05:14 🔗 yipdw like I think it'd be awesome if someone found a rootkit in jQuery that way
05:14 🔗 * JesseW is glad my comparison script has finally made it to the "A"s...
05:17 🔗 yipdw hey that's one way to address The Website Obesity Crisis
05:18 🔗 yipdw er wait no it isn't nm
05:19 🔗 JesseW suggestions for a shellscript to print one line per gigabyte in a file, efficiently?
05:20 🔗 JesseW head -c is slow
05:20 🔗 yipdw do you want to get the first line at each gigabyte boundary?
05:20 🔗 yipdw or output something per GB processed
05:20 🔗 JesseW the first
05:21 🔗 JesseW I want to print the line just after (or before, doesn't matter) each gigabyte boundry
05:21 🔗 JesseW (er, boundary)
05:21 🔗 JesseW it's basically just seek -- I just don't remember an efficient way to do it from the commandline
05:22 🔗 yipdw I don't either
05:22 🔗 yipdw you might have better luck in python etc
05:22 🔗 yipdw I know seek()/fseek exist there
05:22 🔗 JesseW hm... /me pokes at it
05:23 🔗 yipdw are you guaranteed that each line will start on a gigabyte boundary, or is some seeking to find end-of-line required
05:24 🔗 JesseW not guaranteed
05:25 🔗 JesseW but that's just read two lines, and discard the first
05:25 🔗 JesseW as I said, I don't care about exactness, just "what's the general layout around here"
05:26 🔗 JesseW ok, got it in python
05:28 🔗 JesseW so the first gigabyte of the new hashes is all identifiers that start with a digit; it takes up till 5 GB to get to the B's
05:29 🔗 JesseW 3 gigabytes of RECAP
05:30 🔗 JesseW oddly, 3 gigabytes of playdrone-metadata ...?
05:31 🔗 JesseW http://systems.cs.columbia.edu/projects/playdrone/
05:41 🔗 JesseW and there's this one, from the million books project, that looks like it was an early effort, in a weird format: http://archive.org/details/TheMetallurgyOfLead -- it has 3097 files in it
05:59 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
06:11 🔗 godane has joined #archiveteam-bs
06:15 🔗 mutoso has joined #archiveteam-bs
06:17 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
06:22 🔗 mutoso has joined #archiveteam-bs
06:27 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
06:28 🔗 xmc JesseW: here's some numbers on deduping real-world warcs: https://www.taricorp.net/2016/web-history-warc
06:32 🔗 mutoso has joined #archiveteam-bs
06:34 🔗 JesseW neat
06:34 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
06:34 🔗 JesseW Here's an item using a *third* form of "semi-private" https://archive.org/metadata/decom.accumulator03_archive_org.2.texts.TareeqDarbarEDelhi
06:35 🔗 JesseW we're up to: is_dark: true, nodownload: true, private:true (on individual files) and delete.php
06:36 🔗 JesseW plus whatever (presumably renaming) happened to the ~500 identifiers I can't find any history on
06:36 🔗 JesseW I do like the "yet" in the error messages displayed for "nodownload: true"
06:39 🔗 mutoso has joined #archiveteam-bs
06:42 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
06:46 🔗 yipdw a while back I was wondering how I'd use all the extra memory prgmr gave me and now I have the answer
06:47 🔗 yipdw "docker run"
06:47 🔗 yipdw not that docker run is a hog in itself but it does make it pretty easy to launch complicated things
06:47 🔗 mutoso has joined #archiveteam-bs
06:51 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
06:59 🔗 mutoso has joined #archiveteam-bs
07:03 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
07:13 🔗 mutoso has joined #archiveteam-bs
07:26 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
07:26 🔗 BlueMaxim has quit IRC (Read error: Connection reset by peer)
07:29 🔗 BlueMaxim has joined #archiveteam-bs
07:29 🔗 BlueMaxim has quit IRC (Connection closed)
07:31 🔗 mutoso has joined #archiveteam-bs
08:13 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
08:13 🔗 mutoso has joined #archiveteam-bs
08:13 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
08:24 🔗 mutoso has joined #archiveteam-bs
08:37 🔗 godane so i found a usenet site with nzb and nfo files
08:38 🔗 godane i can mirror it by date too
08:39 🔗 JesseW what are nzb and nfo files?
08:40 🔗 godane nzb are the usenet files to download stuff
08:40 🔗 godane nfo are the pirate notes
08:41 🔗 JesseW hm, neat
08:42 🔗 JesseW my census diff has made it to 's'
08:42 🔗 JesseW having identified 5G of differences (the vast majority being private files excluded from the original one)
08:46 🔗 JesseW Still, it has found over 58,000 changed md5 hashes...
08:48 🔗 JesseW of which the vast majority are changes to metadata of various sorts
08:49 🔗 wp494 has joined #archiveteam-bs
08:53 🔗 JesseW found a few errors, like http://archive.org/metadata/2003-11-30.paf.sbd.wizard.23733.sbeok.flacf which somehow got its _files.xml 's format set to "Windows Media"...
08:56 🔗 JesseW has quit IRC (Leaving.)
10:11 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
10:17 🔗 mutoso has joined #archiveteam-bs
10:57 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
11:02 🔗 mutoso has joined #archiveteam-bs
11:12 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
11:13 🔗 vtyl has joined #archiveteam-bs
11:17 🔗 lytv has quit IRC (Read error: Operation timed out)
11:22 🔗 mutoso has joined #archiveteam-bs
11:29 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
11:34 🔗 mutoso has joined #archiveteam-bs
11:41 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
11:47 🔗 mutoso has joined #archiveteam-bs
12:05 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
12:13 🔗 mutoso has joined #archiveteam-bs
12:47 🔗 vitzli has joined #archiveteam-bs
13:16 🔗 VADemon has joined #archiveteam-bs
13:20 🔗 signius has quit IRC (Remote host closed the connection)
13:24 🔗 signius has joined #archiveteam-bs
14:18 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
14:29 🔗 mutoso has joined #archiveteam-bs
14:31 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
14:36 🔗 mutoso has joined #archiveteam-bs
14:36 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
14:42 🔗 mutoso has joined #archiveteam-bs
14:42 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
14:53 🔗 mutoso has joined #archiveteam-bs
14:54 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
15:17 🔗 mutoso has joined #archiveteam-bs
15:35 🔗 JetBalsa has joined #archiveteam-bs
15:41 🔗 godane i'm at 623k items now
16:56 🔗 vitzli has quit IRC (Leaving)
17:15 🔗 chazchaz has quit IRC (Read error: Operation timed out)
17:20 🔗 chazchaz has joined #archiveteam-bs
17:37 🔗 JesseW has joined #archiveteam-bs
17:40 🔗 JesseW has quit IRC (Client Quit)
18:34 🔗 VADemon has quit IRC (Read error: Operation timed out)
18:36 🔗 username1 this is a nice licensing/download page http://jvectormap.com/licenses-and-pricing/
18:36 🔗 username1 is now known as schbirid
18:54 🔗 joepie91 schbirid: hrm.
18:54 🔗 joepie91 schbirid: it wrongly implies that the GPL is not for commercial use
19:01 🔗 schbirid nah, i think it makes it convenient for people to pay if they want to use it commercially ;)
19:21 🔗 schbirid fuck you wget, why can't you handle memory better
19:21 🔗 schbirid i should just always use wpull...
19:25 🔗 schbirid thanks debian https://pastee.org/u66db
19:42 🔗 signius has quit IRC (Read error: Operation timed out)
19:44 🔗 signius has joined #archiveteam-bs
19:44 🔗 joepie91 lol
19:50 🔗 Smiley it relies on dumb managers
19:50 🔗 Smiley can/t blame them tbh
19:50 🔗 midas joepie91: guess who's back
19:50 🔗 midas http://bash.org/
19:51 🔗 Smiley omg grab
19:56 🔗 midas archivebot is already on it
19:58 🔗 MrRadar And was on it two months ago: http://archive.fart.website/archivebot/viewer/job/8hlot
20:09 🔗 yipdw ugh
20:09 🔗 yipdw "Run web traffic over HTTPS", they said
20:09 🔗 yipdw Amazon Elastic Beanstalk does not support HTTPS-based status pings
20:09 🔗 yipdw sih
20:14 🔗 yipdw has quit IRC (Read error: Operation timed out)
20:21 🔗 yipdw has joined #archiveteam-bs
20:22 🔗 joepie91 lol
20:31 🔗 SimpBrain current rough friendsreunited group count, workplaces 268k, schools 114k, armed forces 10k, towns 25k
20:35 🔗 SimpBrain teams 182k, friend groups 24k
20:46 🔗 SimpBrain this is without any profile count
20:46 🔗 SimpBrain schools will be a bigger grab since some schools have nearly 1k users attached
20:46 🔗 SimpBrain and about the same in photos
21:19 🔗 dashcloud has quit IRC (Read error: Operation timed out)
21:21 🔗 schbirid has quit IRC (Quit: Leaving)
21:23 🔗 dashcloud has joined #archiveteam-bs
22:12 🔗 slyphic is now known as slyphic|a
22:12 🔗 ersi yipdw: it'll be fun they said
22:40 🔗 joepie91 very dodgy, ransomware link at the top of bash.org: http://bash.org/?latest
22:40 🔗 joepie91 cc midas Smiley MrRadar
22:53 🔗 MrRadar Yeah, I saw taht
22:53 🔗 MrRadar It's really suspicious
22:54 🔗 MrRadar It's the newest entry on the site
22:54 🔗 MrRadar But it is also the highest-rated
22:54 🔗 MrRadar Definitely reeks of spam
23:12 🔗 RichardG has quit IRC (Ping timeout: 250 seconds)
23:13 🔗 Smiley ??
23:13 🔗 Smiley ah nice
23:13 🔗 Smiley so possibly not really back lol
23:14 🔗 RichardG has joined #archiveteam-bs
23:15 🔗 Smiley ooo yus

irclogger-viewer