#archiveteam-bs 2016-01-25,Mon

↑back Search

Time	Nickname	Message
00:38 ^🔗		wyatt8740 has quit IRC (Read error: Operation timed out)
00:39 ^🔗		wyatt8740 has joined #archiveteam-bs
00:48 ^🔗		kyan has joined #archiveteam-bs
00:51 ^🔗		godane has quit IRC (Read error: Operation timed out)
00:53 ^🔗	JesseW	OK, so now I have two sorted, 3-column tab-separate-value files -- of 15 and 25 gigabytes in size, respectively. I want to know which lines in them have been added, removed or changed. What's a sensible way to do this?
00:54 ^🔗	pikhq	comm
00:56 ^🔗	JesseW	pikhq: for a 25gigabyte file?
00:56 ^🔗	pikhq	comm, unlike diff, works line-by-line and assumes the inputs are sorted.
00:56 ^🔗	JesseW	hm, ok, will try it
01:01 ^🔗	JesseW	hm -- it does work quickly, and incrementally -- but I'm having some trouble figuring out how best to interpret the results
01:02 ^🔗	pikhq	The -1, -2, and -3 options might help. :)
01:02 ^🔗	JesseW	I've done comm -3 to remove the matching lines
01:02 ^🔗	pikhq	Mmkay.
01:02 ^🔗	JesseW	but it's still tricky to go from something like:
01:03 ^🔗	JesseW	0000000000002 0000000000002_archive.torrent 2f0355c11a6bc8dceffecbf7d46e5dce
01:03 ^🔗	JesseW	0000000000002 0000000000002_archive.torrent e929a012ec62c2ee021dfc5a71e749c2
01:03 ^🔗	JesseW	0000000000002 0000000000002_meta.xml 2c1c2c37230e5390b26651ebdf2b84c6
01:03 ^🔗	JesseW	0000000000002 0000000000002_meta.xml cbfcd60fecf27a9c71ab794fa8ebff74
01:03 ^🔗	JesseW	to noticing that the torrent and meta files have changed hashes
01:04 ^🔗	JesseW	I'd like to hack up a display of the above that made that more explicit...
01:04 ^🔗	pikhq	Hmm.
01:29 ^🔗		wp494 has quit IRC (LOUD UNNECESSARY QUIT MESSAGES)
01:37 ^🔗		acridAxid has joined #archiveteam-bs
01:39 ^🔗	JesseW	I think what I want for output is something like:
01:39 ^🔗	JesseW	0000000000002 0000000000002_meta.xml CHANGED
01:39 ^🔗	JesseW	0000000000002 thing.txt ADDED
01:39 ^🔗	JesseW	maybe I can do that with sed...
01:49 ^🔗		toad1 has joined #archiveteam-bs
01:50 ^🔗		toad2 has quit IRC (Read error: Operation timed out)
01:57 ^🔗		VADemon has quit IRC (Quit: left4dead)
02:15 ^🔗	dashcloud	JesseW: you're documenting this process (if only for yourself when you try to do it again) I hope?
02:15 ^🔗		acridAxid has quit IRC (Read error: Operation timed out)
02:19 ^🔗		username1 has joined #archiveteam-bs
02:22 ^🔗		schbirid2 has quit IRC (Read error: Operation timed out)
02:29 ^🔗	JesseW	dashcloud: more or less, yeah
02:31 ^🔗		brayden has joined #archiveteam-bs
02:42 ^🔗		acridAxid has joined #archiveteam-bs
03:42 ^🔗	JesseW	Yay, I've hacked up a sed script that gives the nicer display I wanted!
03:42 ^🔗	JesseW	comm --output-delimiter='!!!' -3 public-file-hashes_20150304205357_sorted.tsv public-file-hashes_20150304205357_recheck_20160120112813.tsv \| head -n 100 \| sed -ne $'/^!!!$/d\nN\ns/^!$[^\\t]$\\t$[^\\t]$\\t$[0-9a-f]$\\n!\\1\\t\\2\\t$[0-9a-f]$$/@ CHANGED\\t\\1\\t\\2 FROM \\4 TO \\3/\ns/^$[^!@][^\\t]\\t[^\\t]$.$/@ REMOVED\\t\\1/\ns/^!!!$[^\\t]\\t[^\\t]$.$/@ ADDED \\t\\1/\np\nD'
03:43 ^🔗	JesseW	enjoy the line noise...
03:46 ^🔗	yipdw	at some point python perl or ruby do begin to have more relevance
03:52 ^🔗	JesseW	sure. But sed is more fun (in some sense) :-)
03:53 ^🔗	JesseW	Once we get a regular schedule of census's going on, I'll probably write a python program to do it.
04:01 ^🔗	JesseW	hm -- the previous census left out the wayback data (admittedly, there was a note to that effect). Interestingly, although it isn't downloadable, the hashes for the files are available -- so my census grabbed them all, and is now reporting ALLOFTHEM as new files. :-)
04:04 ^🔗	JesseW	I think it's about 10 GIGABYTES of METADATA. :-)
04:05 ^🔗	JesseW	hashes of wayback files, I mean.
04:07 ^🔗	phuzion	daaaaaaaang
04:20 ^🔗		acridAxid has quit IRC (Read error: Connection reset by peer)
04:29 ^🔗	JesseW	but it's really good that they make those hashes available, because that way, we can distribute them, and when someone comes to IA saying "secretly change this thing on the Wayback Machine or else" -- IA can point to the external distribution of the (reported) hashes and say, "sure, but it'll get discovered in a couple months, and then where will you be?"
04:30 ^🔗	JesseW	(admittedly, they could still falsify the reporting of the hashes -- but that would require manual changes to the code, and the equivalent of an accountant keeping 2 sets of books -- which in itself would be a lot more obvious to a whole lot more people inside IA)
04:34 ^🔗	JesseW	wow, there are over 600,000 identifiers on IA that start with a digit
04:38 ^🔗	JesseW	and there's about 1.6 GB of hash metadata only in my newer result in there.
04:38 ^🔗	phuzion	<phuzion> daaaaaaaang
04:40 ^🔗	MrRadar	JesseW: Does the IA provide secure hashes or just MD5? While that's a laudable goal MD5 is so broken these days that it provides almost no cryptographic security
04:41 ^🔗	JesseW	They report md5, sha1 and crc32, generally.
04:42 ^🔗	MrRadar	That's slightly better
04:42 ^🔗	JesseW	Unfortunately, the initial census only stored the md5, which is why I decided to follow that in this one.
04:42 ^🔗	MrRadar	Though SHA-1 is on death's doorstep
04:43 ^🔗	JesseW	And, given that much of this stuff is human-readable, I'm not so sure the breakage of md5 is as significant.
04:43 ^🔗	MrRadar	Yeah, finding human-readable junk to add is probably harder than finding any junk to fudge the checksum
04:43 ^🔗	MrRadar	But I would still like to see non-fudgeable checksums to begin with
04:43 ^🔗	MrRadar	But thanks for doing this work
04:44 ^🔗	JesseW	Or has it gotten broken enough that say, one can take an image of Stalin & Trotsky, remove Trotsky, and end up with the same md5 hash?
04:44 ^🔗	espes___	fuck no
04:44 ^🔗	MrRadar	To that point, yes: http://natmchugh.blogspot.com/2014/10/how-i-created-two-images-with-same-md5.html
04:44 ^🔗	JesseW	MrRadar: I'm just taking it to the next step -- Jake at IA did the original work.
04:45 ^🔗	JesseW	heh, wow
04:45 ^🔗	espes___	you can generate colliding data, but that's far from forcing a hash
04:45 ^🔗	espes___	that's just a trick and only requries one colliding block
04:46 ^🔗	JesseW	yeah, I saw that it sneaks the necessary junk in what is effectively a comment
04:48 ^🔗	JesseW	but most formats do have comments...
04:49 ^🔗	espes___	again, finding a collision != finding a preimage
04:50 ^🔗	JesseW	yep
04:55 ^🔗	JesseW	here's a perfectly innocuous change: https://catalogd.archive.org/log/423306379 -- which was detected by my check
04:56 ^🔗	JesseW	@ CHANGED 02196788.1207.emory.edu 02196788.1207.emory.edu_meta.xml FROM b3db8dd19bc7230af632e5ac02a5e41c TO 20604ea11a6ec1266c5c4a7fa8d3d500
05:00 ^🔗	yipdw	JesseW: you'll find at least one md5 collision in ArchiveBot data
05:01 ^🔗	yipdw	in particular I'm pretty sure we have a capture of http://natmchugh.blogspot.com/2014/10/images-with-colliding-md5-hash.html
05:02 ^🔗	yipdw	ah yes http://archive.fart.website/archivebot/viewer/job/eeumo
05:02 ^🔗	yipdw	oh wait MrRadar already posted that, go me for not reading
05:02 ^🔗	*	yipdw is digital native
05:04 ^🔗	xmc	curious how many bytes of duplicate files exist in IA
05:04 ^🔗	xmc	the question that everybody new here asks :P
05:04 ^🔗	yipdw	1, for an appropriately sized byte
05:05 ^🔗	xmc	for greater efficiency in transmission, we will deliver first all of the 0 bits and then all of the 1 bits.
05:05 ^🔗	JesseW	xmc: already answered (imperfectly) on the census page: http://archiveteam.org/index.php?title=Internet_Archive_Census
05:05 ^🔗	xmc	o
05:06 ^🔗	xmc	1PB or so
05:06 ^🔗	xmc	sounds about right
05:06 ^🔗	JesseW	There are 22,596,286 files which are copies of other files. The duplicate files take up 1.06PB of space. (Assuming all files with the same MD5 are duplicates.)
05:06 ^🔗	yipdw	not bad
05:06 ^🔗	JesseW	well, there's a bunch more duplication -- that's just what was counted, there
05:07 ^🔗	xmc	aye
05:07 ^🔗	JesseW	AFAIK, everything is duplicated once, for backup
05:07 ^🔗	xmc	yep
05:07 ^🔗	JesseW	and some is duplicated in Egypt
05:07 ^🔗	xmc	one copy on each of two different nodes
05:08 ^🔗	JesseW	and lots of stuff is duplicated but broken up into files differently
05:09 ^🔗	JesseW	i.e. I know archivebot has hundreds of thousands of copies of some standard google javascript libraries -- none of which is counted there
05:09 ^🔗	JesseW	because it's all inside of WARCs that aren't identical in full
05:09 ^🔗	yipdw	don't petabyte if its end is bristling
05:10 ^🔗	JesseW	if a byte becomes infected, it may be necessary to kilabyte
05:10 ^🔗		acridAxid has joined #archiveteam-bs
05:12 ^🔗	yipdw	I think it would be fun to do a comparison on the zillions of copies of jQuery
05:12 ^🔗	yipdw	group by version and see how many are actually identical, for increasingly sophisticated definitions of "identical"
05:12 ^🔗	yipdw	someone who wants to get into static analysis of Javascript might have some fun there
05:13 ^🔗	JesseW	lol
05:13 ^🔗	yipdw	of course the group-by part might be difficult
05:13 ^🔗	JesseW	yes, that could be rather entertaining
05:14 ^🔗	yipdw	like I think it'd be awesome if someone found a rootkit in jQuery that way
05:14 ^🔗	*	JesseW is glad my comparison script has finally made it to the "A"s...
05:17 ^🔗	yipdw	hey that's one way to address The Website Obesity Crisis
05:18 ^🔗	yipdw	er wait no it isn't nm
05:19 ^🔗	JesseW	suggestions for a shellscript to print one line per gigabyte in a file, efficiently?
05:20 ^🔗	JesseW	head -c is slow
05:20 ^🔗	yipdw	do you want to get the first line at each gigabyte boundary?
05:20 ^🔗	yipdw	or output something per GB processed
05:20 ^🔗	JesseW	the first
05:21 ^🔗	JesseW	I want to print the line just after (or before, doesn't matter) each gigabyte boundry
05:21 ^🔗	JesseW	(er, boundary)
05:21 ^🔗	JesseW	it's basically just seek -- I just don't remember an efficient way to do it from the commandline
05:22 ^🔗	yipdw	I don't either
05:22 ^🔗	yipdw	you might have better luck in python etc
05:22 ^🔗	yipdw	I know seek()/fseek exist there
05:22 ^🔗	JesseW	hm... /me pokes at it
05:23 ^🔗	yipdw	are you guaranteed that each line will start on a gigabyte boundary, or is some seeking to find end-of-line required
05:24 ^🔗	JesseW	not guaranteed
05:25 ^🔗	JesseW	but that's just read two lines, and discard the first
05:25 ^🔗	JesseW	as I said, I don't care about exactness, just "what's the general layout around here"
05:26 ^🔗	JesseW	ok, got it in python
05:28 ^🔗	JesseW	so the first gigabyte of the new hashes is all identifiers that start with a digit; it takes up till 5 GB to get to the B's
05:29 ^🔗	JesseW	3 gigabytes of RECAP
05:30 ^🔗	JesseW	oddly, 3 gigabytes of playdrone-metadata ...?
05:31 ^🔗	JesseW	http://systems.cs.columbia.edu/projects/playdrone/
05:41 ^🔗	JesseW	and there's this one, from the million books project, that looks like it was an early effort, in a weird format: http://archive.org/details/TheMetallurgyOfLead -- it has 3097 files in it
05:59 ^🔗		mutoso has quit IRC (Read error: Connection reset by peer)
06:11 ^🔗		godane has joined #archiveteam-bs
06:15 ^🔗		mutoso has joined #archiveteam-bs
06:17 ^🔗		mutoso has quit IRC (Read error: Connection reset by peer)
06:22 ^🔗		mutoso has joined #archiveteam-bs
06:27 ^🔗		mutoso has quit IRC (Read error: Connection reset by peer)
06:28 ^🔗	xmc	JesseW: here's some numbers on deduping real-world warcs: https://www.taricorp.net/2016/web-history-warc
06:32 ^🔗		mutoso has joined #archiveteam-bs
06:34 ^🔗	JesseW	neat
06:34 ^🔗		mutoso has quit IRC (Read error: Connection reset by peer)
06:34 ^🔗	JesseW	Here's an item using a third form of "semi-private" https://archive.org/metadata/decom.accumulator03_archive_org.2.texts.TareeqDarbarEDelhi
06:35 ^🔗	JesseW	we're up to: is_dark: true, nodownload: true, private:true (on individual files) and delete.php
06:36 ^🔗	JesseW	plus whatever (presumably renaming) happened to the ~500 identifiers I can't find any history on
06:36 ^🔗	JesseW	I do like the "yet" in the error messages displayed for "nodownload: true"
06:39 ^🔗		mutoso has joined #archiveteam-bs
06:42 ^🔗		mutoso has quit IRC (Read error: Connection reset by peer)
06:46 ^🔗	yipdw	a while back I was wondering how I'd use all the extra memory prgmr gave me and now I have the answer
06:47 ^🔗	yipdw	"docker run"
06:47 ^🔗	yipdw	not that docker run is a hog in itself but it does make it pretty easy to launch complicated things
06:47 ^🔗		mutoso has joined #archiveteam-bs
06:51 ^🔗		mutoso has quit IRC (Read error: Connection reset by peer)
06:59 ^🔗		mutoso has joined #archiveteam-bs
07:03 ^🔗		mutoso has quit IRC (Read error: Connection reset by peer)
07:13 ^🔗		mutoso has joined #archiveteam-bs
07:26 ^🔗		mutoso has quit IRC (Read error: Connection reset by peer)
07:26 ^🔗		BlueMaxim has quit IRC (Read error: Connection reset by peer)
07:29 ^🔗		BlueMaxim has joined #archiveteam-bs
07:29 ^🔗		BlueMaxim has quit IRC (Connection closed)
07:31 ^🔗		mutoso has joined #archiveteam-bs
08:13 ^🔗		mutoso has quit IRC (Read error: Connection reset by peer)
08:13 ^🔗		mutoso has joined #archiveteam-bs
08:13 ^🔗		mutoso has quit IRC (Read error: Connection reset by peer)
08:24 ^🔗		mutoso has joined #archiveteam-bs
08:37 ^🔗	godane	so i found a usenet site with nzb and nfo files
08:38 ^🔗	godane	i can mirror it by date too
08:39 ^🔗	JesseW	what are nzb and nfo files?
08:40 ^🔗	godane	nzb are the usenet files to download stuff
08:40 ^🔗	godane	nfo are the pirate notes
08:41 ^🔗	JesseW	hm, neat
08:42 ^🔗	JesseW	my census diff has made it to 's'
08:42 ^🔗	JesseW	having identified 5G of differences (the vast majority being private files excluded from the original one)
08:46 ^🔗	JesseW	Still, it has found over 58,000 changed md5 hashes...
08:48 ^🔗	JesseW	of which the vast majority are changes to metadata of various sorts
08:49 ^🔗		wp494 has joined #archiveteam-bs
08:53 ^🔗	JesseW	found a few errors, like http://archive.org/metadata/2003-11-30.paf.sbd.wizard.23733.sbeok.flacf which somehow got its _files.xml 's format set to "Windows Media"...
08:56 ^🔗		JesseW has quit IRC (Leaving.)
10:11 ^🔗		mutoso has quit IRC (Read error: Connection reset by peer)
10:17 ^🔗		mutoso has joined #archiveteam-bs
10:57 ^🔗		mutoso has quit IRC (Read error: Connection reset by peer)
11:02 ^🔗		mutoso has joined #archiveteam-bs
11:12 ^🔗		mutoso has quit IRC (Read error: Connection reset by peer)
11:13 ^🔗		vtyl has joined #archiveteam-bs
11:17 ^🔗		lytv has quit IRC (Read error: Operation timed out)
11:22 ^🔗		mutoso has joined #archiveteam-bs
11:29 ^🔗		mutoso has quit IRC (Read error: Connection reset by peer)
11:34 ^🔗		mutoso has joined #archiveteam-bs
11:41 ^🔗		mutoso has quit IRC (Read error: Connection reset by peer)
11:47 ^🔗		mutoso has joined #archiveteam-bs
12:05 ^🔗		mutoso has quit IRC (Read error: Connection reset by peer)
12:13 ^🔗		mutoso has joined #archiveteam-bs
12:47 ^🔗		vitzli has joined #archiveteam-bs
13:16 ^🔗		VADemon has joined #archiveteam-bs
13:20 ^🔗		signius has quit IRC (Remote host closed the connection)
13:24 ^🔗		signius has joined #archiveteam-bs
14:18 ^🔗		mutoso has quit IRC (Read error: Connection reset by peer)
14:29 ^🔗		mutoso has joined #archiveteam-bs
14:31 ^🔗		mutoso has quit IRC (Read error: Connection reset by peer)
14:36 ^🔗		mutoso has joined #archiveteam-bs
14:36 ^🔗		mutoso has quit IRC (Read error: Connection reset by peer)
14:42 ^🔗		mutoso has joined #archiveteam-bs
14:42 ^🔗		mutoso has quit IRC (Read error: Connection reset by peer)
14:53 ^🔗		mutoso has joined #archiveteam-bs
14:54 ^🔗		mutoso has quit IRC (Read error: Connection reset by peer)
15:17 ^🔗		mutoso has joined #archiveteam-bs
15:35 ^🔗		JetBalsa has joined #archiveteam-bs
15:41 ^🔗	godane	i'm at 623k items now
16:56 ^🔗		vitzli has quit IRC (Leaving)
17:15 ^🔗		chazchaz has quit IRC (Read error: Operation timed out)
17:20 ^🔗		chazchaz has joined #archiveteam-bs
17:37 ^🔗		JesseW has joined #archiveteam-bs
17:40 ^🔗		JesseW has quit IRC (Client Quit)
18:34 ^🔗		VADemon has quit IRC (Read error: Operation timed out)
18:36 ^🔗	username1	this is a nice licensing/download page http://jvectormap.com/licenses-and-pricing/
18:36 ^🔗		username1 is now known as schbirid
18:54 ^🔗	joepie91	schbirid: hrm.
18:54 ^🔗	joepie91	schbirid: it wrongly implies that the GPL is not for commercial use
19:01 ^🔗	schbirid	nah, i think it makes it convenient for people to pay if they want to use it commercially ;)
19:21 ^🔗	schbirid	fuck you wget, why can't you handle memory better
19:21 ^🔗	schbirid	i should just always use wpull...
19:25 ^🔗	schbirid	thanks debian https://pastee.org/u66db
19:42 ^🔗		signius has quit IRC (Read error: Operation timed out)
19:44 ^🔗		signius has joined #archiveteam-bs
19:44 ^🔗	joepie91	lol
19:50 ^🔗	Smiley	it relies on dumb managers
19:50 ^🔗	Smiley	can/t blame them tbh
19:50 ^🔗	midas	joepie91: guess who's back
19:50 ^🔗	midas	http://bash.org/
19:51 ^🔗	Smiley	omg grab
19:56 ^🔗	midas	archivebot is already on it
19:58 ^🔗	MrRadar	And was on it two months ago: http://archive.fart.website/archivebot/viewer/job/8hlot
20:09 ^🔗	yipdw	ugh
20:09 ^🔗	yipdw	"Run web traffic over HTTPS", they said
20:09 ^🔗	yipdw	Amazon Elastic Beanstalk does not support HTTPS-based status pings
20:09 ^🔗	yipdw	sih
20:14 ^🔗		yipdw has quit IRC (Read error: Operation timed out)
20:21 ^🔗		yipdw has joined #archiveteam-bs
20:22 ^🔗	joepie91	lol
20:31 ^🔗	SimpBrain	current rough friendsreunited group count, workplaces 268k, schools 114k, armed forces 10k, towns 25k
20:35 ^🔗	SimpBrain	teams 182k, friend groups 24k
20:46 ^🔗	SimpBrain	this is without any profile count
20:46 ^🔗	SimpBrain	schools will be a bigger grab since some schools have nearly 1k users attached
20:46 ^🔗	SimpBrain	and about the same in photos
21:19 ^🔗		dashcloud has quit IRC (Read error: Operation timed out)
21:21 ^🔗		schbirid has quit IRC (Quit: Leaving)
21:23 ^🔗		dashcloud has joined #archiveteam-bs
22:12 ^🔗		slyphic is now known as slyphic\|a
22:12 ^🔗	ersi	yipdw: it'll be fun they said
22:40 ^🔗	joepie91	very dodgy, ransomware link at the top of bash.org: http://bash.org/?latest
22:40 ^🔗	joepie91	cc midas Smiley MrRadar
22:53 ^🔗	MrRadar	Yeah, I saw taht
22:53 ^🔗	MrRadar	It's really suspicious
22:54 ^🔗	MrRadar	It's the newest entry on the site
22:54 ^🔗	MrRadar	But it is also the highest-rated
22:54 ^🔗	MrRadar	Definitely reeks of spam
23:12 ^🔗		RichardG has quit IRC (Ping timeout: 250 seconds)
23:13 ^🔗	Smiley	??
23:13 ^🔗	Smiley	ah nice
23:13 ^🔗	Smiley	so possibly not really back lol
23:14 ^🔗		RichardG has joined #archiveteam-bs
23:15 ^🔗	Smiley	ooo yus

irclogger-viewer