[00:38] *** wyatt8740 has quit IRC (Read error: Operation timed out)
[00:39] *** wyatt8740 has joined #archiveteam-bs
[00:48] *** kyan has joined #archiveteam-bs
[00:51] *** godane has quit IRC (Read error: Operation timed out)
[00:53] <JesseW> OK, so now I have two sorted, 3-column tab-separate-value files -- of 15 and 25 gigabytes in size, respectively. I want to know which lines in them have been added, removed or changed. What's a sensible way to do this?
[00:54] <pikhq> comm
[00:56] <JesseW> pikhq: for a 25gigabyte file?
[00:56] <pikhq> comm, unlike diff, works line-by-line and assumes the inputs are sorted.
[00:56] <JesseW> hm, ok, will try it
[01:01] <JesseW> hm -- it does work quickly, and incrementally -- but I'm having some trouble figuring out how best to interpret the results 
[01:02] <pikhq> The -1, -2, and -3 options might help. :)
[01:02] <JesseW> I've done comm -3 to remove the matching lines
[01:02] <pikhq> Mmkay.
[01:02] <JesseW> but it's still tricky to go from something like:
[01:03] <JesseW>         0000000000002   0000000000002_archive.torrent   2f0355c11a6bc8dceffecbf7d46e5dce
[01:03] <JesseW> 0000000000002   0000000000002_archive.torrent   e929a012ec62c2ee021dfc5a71e749c2
[01:03] <JesseW> 0000000000002   0000000000002_meta.xml  2c1c2c37230e5390b26651ebdf2b84c6
[01:03] <JesseW>         0000000000002   0000000000002_meta.xml  cbfcd60fecf27a9c71ab794fa8ebff74
[01:03] <JesseW> to noticing that the torrent and meta files have changed hashes
[01:04] <JesseW> I'd like to hack up a display of the above that made that more explicit...
[01:04] <pikhq> Hmm.
[01:29] *** wp494 has quit IRC (LOUD UNNECESSARY QUIT MESSAGES)
[01:37] *** acridAxid has joined #archiveteam-bs
[01:39] <JesseW> I think what I want for output is something like:
[01:39] <JesseW> 0000000000002   0000000000002_meta.xml  CHANGED
[01:39] <JesseW> 0000000000002   thing.txt ADDED
[01:39] <JesseW> maybe I can do that with sed...
[01:49] *** toad1 has joined #archiveteam-bs
[01:50] *** toad2 has quit IRC (Read error: Operation timed out)
[01:57] *** VADemon has quit IRC (Quit: left4dead)
[02:15] <dashcloud> JesseW: you're documenting this process (if only for yourself when you try to do it again) I hope?
[02:15] *** acridAxid has quit IRC (Read error: Operation timed out)
[02:19] *** username1 has joined #archiveteam-bs
[02:22] *** schbirid2 has quit IRC (Read error: Operation timed out)
[02:29] <JesseW> dashcloud: more or less, yeah
[02:31] *** brayden has joined #archiveteam-bs
[02:42] *** acridAxid has joined #archiveteam-bs
[03:42] <JesseW> Yay, I've hacked up a sed script that gives the nicer display I wanted!
[03:42] <JesseW> comm --output-delimiter='!!!' -3 public-file-hashes_20150304205357_sorted.tsv public-file-hashes_20150304205357_recheck_20160120112813.tsv | head -n 100 | sed -ne $'/^!!!$/d\nN\ns/^!*\([^\\t]*\)\\t\([^\\t]*\)\\t\([0-9a-f]*\)\\n!*\\1\\t\\2\\t\([0-9a-f]*\)$/@ CHANGED\\t\\1\\t\\2 FROM \\4 TO \\3/\ns/^\([^!@][^\\t]*\\t[^\\t]*\).*$/@ REMOVED\\t\\1/\ns/^!!!\([^\\t]*\\t[^\\t]*\).*$/@ ADDED  \\t\\1/\np\nD'
[03:43] <JesseW> enjoy the line noise...
[03:46] <yipdw> at some point python perl or ruby do begin to have more relevance
[03:52] <JesseW> sure. But sed is more fun (in some sense) :-)
[03:53] <JesseW> Once we get a regular schedule of census's going on, I'll probably write a python program to do it.
[04:01] <JesseW> hm -- the previous census left out the wayback data (admittedly, there was a note to that effect). Interestingly, although it isn't downloadable, the hashes for the files *are* available -- so my census grabbed them all, and is now reporting ALLOFTHEM as new files. :-)
[04:04] <JesseW> I think it's about 10 GIGABYTES of *METADATA*. :-)
[04:05] <JesseW> hashes of wayback files, I mean.
[04:07] <phuzion> daaaaaaaang
[04:20] *** acridAxid has quit IRC (Read error: Connection reset by peer)
[04:29] <JesseW> but it's really good that they make those hashes available, because that way, we can distribute them, and when someone comes to IA saying "secretly change this thing on the Wayback Machine or else" -- IA can point to the external distribution of the (reported) hashes and say, "sure, but it'll get discovered in a couple months, and then where will you be?"
[04:30] <JesseW> (admittedly, they could still falsify the reporting of the hashes -- but that would require manual changes to the code, and the equivalent of an accountant keeping 2 sets of books -- which in itself would be a lot more obvious to a whole lot more people *inside* IA)
[04:34] <JesseW> wow, there are over 600,000 identifiers on IA that start with a digit
[04:38] <JesseW> and there's about 1.6 GB of hash metadata only in my newer result in there.
[04:38] <phuzion> <phuzion> daaaaaaaang
[04:40] <MrRadar> JesseW: Does the IA provide secure hashes or just MD5? While that's a laudable goal MD5 is so broken these days that it provides almost no cryptographic security
[04:41] <JesseW> They report md5, sha1 and crc32, generally.
[04:42] <MrRadar> That's slightly better
[04:42] <JesseW> Unfortunately, the initial census only stored the md5, which is why I decided to follow that in this one.
[04:42] <MrRadar> Though SHA-1 is on death's doorstep
[04:43] <JesseW> And, given that much of this stuff is human-readable, I'm not so sure the breakage of md5 is as significant.
[04:43] <MrRadar> Yeah, finding human-readable junk to add is probably harder than finding any junk to fudge the checksum
[04:43] <MrRadar> But I would still like to see non-fudgeable checksums to begin with
[04:43] <MrRadar> But thanks for doing this work
[04:44] <JesseW> Or has it gotten broken enough that say, one can take an image of Stalin & Trotsky, remove Trotsky, and end up with the same md5 hash?
[04:44] <espes___> fuck no
[04:44] <MrRadar> To that point, yes: http://natmchugh.blogspot.com/2014/10/how-i-created-two-images-with-same-md5.html
[04:44] <JesseW> MrRadar: I'm just taking it to the next step -- Jake at IA did the original work.
[04:45] <JesseW> heh, wow
[04:45] <espes___> you can generate colliding data, but that's far from forcing a hash
[04:45] <espes___> that's just a trick and only requries one colliding block
[04:46] <JesseW> yeah, I saw that it sneaks the necessary junk in what is effectively a comment
[04:48] <JesseW> but most formats do *have* comments...
[04:49] <espes___> again, finding a collision != finding a preimage
[04:50] <JesseW> yep
[04:55] <JesseW> here's a perfectly innocuous change: https://catalogd.archive.org/log/423306379 -- which was detected by my check
[04:56] <JesseW> @ CHANGED       02196788.1207.emory.edu 02196788.1207.emory.edu_meta.xml FROM b3db8dd19bc7230af632e5ac02a5e41c TO 20604ea11a6ec1266c5c4a7fa8d3d500
[05:00] <yipdw> JesseW: you'll find at least one md5 collision in ArchiveBot data
[05:01] <yipdw> in particular I'm pretty sure we have a capture of http://natmchugh.blogspot.com/2014/10/images-with-colliding-md5-hash.html
[05:02] <yipdw> ah yes http://archive.fart.website/archivebot/viewer/job/eeumo
[05:02] <yipdw> oh wait MrRadar already posted that, go me for not reading
[05:02] * yipdw is digital native
[05:04] <xmc> curious how many bytes of duplicate files exist in IA
[05:04] <xmc> the question that everybody new here asks :P
[05:04] <yipdw> 1, for an appropriately sized byte
[05:05] <xmc> for greater efficiency in transmission, we will deliver first all of the 0 bits and then all of the 1 bits.
[05:05] <JesseW> xmc: already answered (imperfectly) on the census page: http://archiveteam.org/index.php?title=Internet_Archive_Census
[05:05] <xmc> o
[05:06] <xmc> 1PB or so
[05:06] <xmc> sounds about right
[05:06] <JesseW>  There are 22,596,286 files which are copies of other files. The  duplicate files take up 1.06PB of space. (Assuming all files with the  same MD5 are duplicates.)
[05:06] <yipdw> not bad
[05:06] <JesseW> well, there's a bunch more duplication -- that's just what was counted, there
[05:07] <xmc> aye
[05:07] <JesseW> AFAIK, everything is duplicated once, for backup
[05:07] <xmc> yep
[05:07] <JesseW> and some is duplicated in Egypt
[05:07] <xmc> one copy on each of two different nodes
[05:08] <JesseW> and lots of stuff is duplicated but broken up into files differently
[05:09] <JesseW> i.e. I know archivebot has hundreds of thousands of copies of some standard google javascript libraries -- none of which is counted there
[05:09] <JesseW> because it's all inside of WARCs that aren't identical in full
[05:09] <yipdw> don't petabyte if its end is bristling
[05:10] <JesseW> if a byte becomes infected, it may be necessary to kilabyte
[05:10] *** acridAxid has joined #archiveteam-bs
[05:12] <yipdw> I think it would be fun to do a comparison on the zillions of copies of jQuery
[05:12] <yipdw> group by version and see how many are actually identical, for increasingly sophisticated definitions of "identical"
[05:12] <yipdw> someone who wants to get into static analysis of Javascript might have some fun there
[05:13] <JesseW> lol
[05:13] <yipdw> of course the group-by part might be difficult
[05:13] <JesseW> yes, that could be rather entertaining
[05:14] <yipdw> like I think it'd be awesome if someone found a rootkit in jQuery that way
[05:14] * JesseW is glad my comparison script has finally made it to the "A"s...
[05:17] <yipdw> hey that's one way to address The Website Obesity Crisis
[05:18] <yipdw> er wait no it isn't nm
[05:19] <JesseW> suggestions for a shellscript to print one line per gigabyte in a file, efficiently?
[05:20] <JesseW> head -c is slow
[05:20] <yipdw> do you want to get the first line at each gigabyte boundary?
[05:20] <yipdw> or output something per GB processed
[05:20] <JesseW> the first
[05:21] <JesseW> I want to print the line just after (or before, doesn't matter) each gigabyte boundry
[05:21] <JesseW> (er, boundary)
[05:21] <JesseW> it's basically just seek -- I just don't remember an efficient way to do it from the commandline
[05:22] <yipdw> I don't either
[05:22] <yipdw> you might have better luck in python etc
[05:22] <yipdw> I know seek()/fseek exist there
[05:22] <JesseW> hm... /me pokes at it
[05:23] <yipdw> are you guaranteed that each line will start on a gigabyte boundary, or is some seeking to find end-of-line required
[05:24] <JesseW> not guaranteed
[05:25] <JesseW> but that's just read two lines, and discard the first
[05:25] <JesseW> as I said, I don't care about exactness, just "what's the general layout around here"
[05:26] <JesseW> ok, got it in python
[05:28] <JesseW> so the first gigabyte of the new hashes is all identifiers that start with a digit; it takes up till 5 GB to get to the B's
[05:29] <JesseW> 3 gigabytes of RECAP
[05:30] <JesseW> oddly, 3 gigabytes of playdrone-metadata ...?
[05:31] <JesseW> http://systems.cs.columbia.edu/projects/playdrone/
[05:41] <JesseW> and there's this one, from the million books project, that looks like it was an early effort, in a weird format: http://archive.org/details/TheMetallurgyOfLead -- it has 3097 files in it
[05:59] *** mutoso has quit IRC (Read error: Connection reset by peer)
[06:11] *** godane has joined #archiveteam-bs
[06:15] *** mutoso has joined #archiveteam-bs
[06:17] *** mutoso has quit IRC (Read error: Connection reset by peer)
[06:22] *** mutoso has joined #archiveteam-bs
[06:27] *** mutoso has quit IRC (Read error: Connection reset by peer)
[06:28] <xmc> JesseW: here's some numbers on deduping real-world warcs: https://www.taricorp.net/2016/web-history-warc
[06:32] *** mutoso has joined #archiveteam-bs
[06:34] <JesseW> neat
[06:34] *** mutoso has quit IRC (Read error: Connection reset by peer)
[06:34] <JesseW> Here's an item using a *third* form of "semi-private" https://archive.org/metadata/decom.accumulator03_archive_org.2.texts.TareeqDarbarEDelhi
[06:35] <JesseW> we're up to: is_dark: true, nodownload: true, private:true (on individual files) and delete.php
[06:36] <JesseW> plus whatever (presumably renaming) happened to the ~500 identifiers I can't find any history on
[06:36] <JesseW> I do like the "yet" in the error messages displayed for "nodownload: true"
[06:39] *** mutoso has joined #archiveteam-bs
[06:42] *** mutoso has quit IRC (Read error: Connection reset by peer)
[06:46] <yipdw> a while back I was wondering how I'd use all the extra memory prgmr gave me and now I have the answer
[06:47] <yipdw> "docker run"
[06:47] <yipdw> not that docker run is a hog in itself but it does make it pretty easy to launch complicated things
[06:47] *** mutoso has joined #archiveteam-bs
[06:51] *** mutoso has quit IRC (Read error: Connection reset by peer)
[06:59] *** mutoso has joined #archiveteam-bs
[07:03] *** mutoso has quit IRC (Read error: Connection reset by peer)
[07:13] *** mutoso has joined #archiveteam-bs
[07:26] *** mutoso has quit IRC (Read error: Connection reset by peer)
[07:26] *** BlueMaxim has quit IRC (Read error: Connection reset by peer)
[07:29] *** BlueMaxim has joined #archiveteam-bs
[07:29] *** BlueMaxim has quit IRC (Connection closed)
[07:31] *** mutoso has joined #archiveteam-bs
[08:13] *** mutoso has quit IRC (Read error: Connection reset by peer)
[08:13] *** mutoso has joined #archiveteam-bs
[08:13] *** mutoso has quit IRC (Read error: Connection reset by peer)
[08:24] *** mutoso has joined #archiveteam-bs
[08:37] <godane> so i found a usenet site with nzb and nfo files
[08:38] <godane> i can mirror it by date too
[08:39] <JesseW> what are nzb and nfo files?
[08:40] <godane> nzb are the usenet files to download stuff
[08:40] <godane> nfo are the pirate notes
[08:41] <JesseW> hm, neat
[08:42] <JesseW> my census diff has made it to 's'
[08:42] <JesseW> having identified 5G of differences (the vast majority being private files excluded from the original one)
[08:46] <JesseW> Still, it has found over 58,000 changed md5 hashes...
[08:48] <JesseW> of which the vast majority are changes to metadata of various sorts
[08:49] *** wp494 has joined #archiveteam-bs
[08:53] <JesseW> found a few errors, like http://archive.org/metadata/2003-11-30.paf.sbd.wizard.23733.sbeok.flacf which somehow got its _files.xml 's format set to "Windows Media"...
[08:56] *** JesseW has quit IRC (Leaving.)
[10:11] *** mutoso has quit IRC (Read error: Connection reset by peer)
[10:17] *** mutoso has joined #archiveteam-bs
[10:57] *** mutoso has quit IRC (Read error: Connection reset by peer)
[11:02] *** mutoso has joined #archiveteam-bs
[11:12] *** mutoso has quit IRC (Read error: Connection reset by peer)
[11:13] *** vtyl has joined #archiveteam-bs
[11:17] *** lytv has quit IRC (Read error: Operation timed out)
[11:22] *** mutoso has joined #archiveteam-bs
[11:29] *** mutoso has quit IRC (Read error: Connection reset by peer)
[11:34] *** mutoso has joined #archiveteam-bs
[11:41] *** mutoso has quit IRC (Read error: Connection reset by peer)
[11:47] *** mutoso has joined #archiveteam-bs
[12:05] *** mutoso has quit IRC (Read error: Connection reset by peer)
[12:13] *** mutoso has joined #archiveteam-bs
[12:47] *** vitzli has joined #archiveteam-bs
[13:16] *** VADemon has joined #archiveteam-bs
[13:20] *** signius has quit IRC (Remote host closed the connection)
[13:24] *** signius has joined #archiveteam-bs
[14:18] *** mutoso has quit IRC (Read error: Connection reset by peer)
[14:29] *** mutoso has joined #archiveteam-bs
[14:31] *** mutoso has quit IRC (Read error: Connection reset by peer)
[14:36] *** mutoso has joined #archiveteam-bs
[14:36] *** mutoso has quit IRC (Read error: Connection reset by peer)
[14:42] *** mutoso has joined #archiveteam-bs
[14:42] *** mutoso has quit IRC (Read error: Connection reset by peer)
[14:53] *** mutoso has joined #archiveteam-bs
[14:54] *** mutoso has quit IRC (Read error: Connection reset by peer)
[15:17] *** mutoso has joined #archiveteam-bs
[15:35] *** JetBalsa has joined #archiveteam-bs
[15:41] <godane> i'm at 623k items now
[16:56] *** vitzli has quit IRC (Leaving)
[17:15] *** chazchaz has quit IRC (Read error: Operation timed out)
[17:20] *** chazchaz has joined #archiveteam-bs
[17:37] *** JesseW has joined #archiveteam-bs
[17:40] *** JesseW has quit IRC (Client Quit)
[18:34] *** VADemon has quit IRC (Read error: Operation timed out)
[18:36] <username1> this is a nice licensing/download page http://jvectormap.com/licenses-and-pricing/
[18:36] *** username1 is now known as schbirid
[18:54] <joepie91> schbirid: hrm.
[18:54] <joepie91> schbirid: it wrongly implies that the GPL is not for commercial use
[19:01] <schbirid> nah, i think it makes it convenient for people to pay if they want to use it commercially ;)
[19:21] <schbirid> fuck you wget, why can't you handle memory better
[19:21] <schbirid> i should just always use wpull...
[19:25] <schbirid> thanks debian https://pastee.org/u66db
[19:42] *** signius has quit IRC (Read error: Operation timed out)
[19:44] *** signius has joined #archiveteam-bs
[19:44] <joepie91> lol
[19:50] <Smiley> it relies on dumb managers
[19:50] <Smiley> can/t blame them tbh
[19:50] <midas> joepie91: guess who's back
[19:50] <midas> http://bash.org/
[19:51] <Smiley> omg grab
[19:56] <midas> archivebot is already on it
[19:58] <MrRadar> And was on it two months ago: http://archive.fart.website/archivebot/viewer/job/8hlot
[20:09] <yipdw> ugh
[20:09] <yipdw> "Run web traffic over HTTPS", they said
[20:09] <yipdw> Amazon Elastic Beanstalk does not support HTTPS-based status pings
[20:09] <yipdw> sih
[20:14] *** yipdw has quit IRC (Read error: Operation timed out)
[20:21] *** yipdw has joined #archiveteam-bs
[20:22] <joepie91> lol
[20:31] <SimpBrain> current rough friendsreunited group count, workplaces 268k, schools 114k, armed forces 10k, towns 25k
[20:35] <SimpBrain> teams 182k, friend groups 24k
[20:46] <SimpBrain> this is without any profile count
[20:46] <SimpBrain> schools will be a bigger grab since some schools have nearly 1k users attached
[20:46] <SimpBrain> and about the same in photos
[21:19] *** dashcloud has quit IRC (Read error: Operation timed out)
[21:21] *** schbirid has quit IRC (Quit: Leaving)
[21:23] *** dashcloud has joined #archiveteam-bs
[22:12] *** slyphic is now known as slyphic|a
[22:12] <ersi> yipdw: it'll be fun they said
[22:40] <joepie91> very dodgy, ransomware link at the top of bash.org: http://bash.org/?latest
[22:40] <joepie91> cc midas Smiley MrRadar
[22:53] <MrRadar> Yeah, I saw taht
[22:53] <MrRadar> It's really suspicious
[22:54] <MrRadar> It's the newest entry on the site
[22:54] <MrRadar> But it is also the highest-rated
[22:54] <MrRadar> Definitely reeks of spam
[23:12] *** RichardG has quit IRC (Ping timeout: 250 seconds)
[23:13] <Smiley> ??
[23:13] <Smiley> ah nice
[23:13] <Smiley> so possibly not really back lol
[23:14] *** RichardG has joined #archiveteam-bs
[23:15] <Smiley> ooo yus