#archiveteam 2013-11-01,Fri

↑back Search

Time	Nickname	Message
00:14 ^🔗	balrog	I'd like to find a copy of the magazine containing the type-in code for http://www.worldofspectrum.org/infoseekid.cgi?id=0003101
00:14 ^🔗	odie5533	What happens to the Warc info records when you concatenate two warc files?
00:14 ^🔗	balrog	(it's a German magazine called Happy Computer, this was "ZX Spectrum Sonderheft 1")
00:16 ^🔗	ivan`	odie5533: nothing? I assume IA can import concatenated warcs since they import megawarcs
00:22 ^🔗	odie5533	ivan`: I am thinking of creating a warc file for every record, so I wasn't sure if I should leave off warc info records or not.
00:22 ^🔗	odie5533	probably should
00:26 ^🔗	odie5533	Anyone here write tools for handling warc files and know the format?
00:29 ^🔗	dashcloud	probably the best resource: http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem
00:29 ^🔗	dashcloud	tons of info on the format and the tools
00:29 ^🔗	godane	2011 and 2012 episodes of Engadget podcast is backup
00:29 ^🔗	odie5533	I have the spec, I'm just wanting someone to bounce ideas off of.
00:30 ^🔗	odie5533	Does tef come around here much?
00:31 ^🔗	dashcloud	odie5533: also on the page are a number of tools implementing WARC or doing something to WARC files (lots of source code as well)
00:32 ^🔗	odie5533	dashcloud: yeah, I know. I've written a few of them. But I still feel lonely and want someone to talk to.
07:16 ^🔗	Nemo_bis	So, are we done with blip.tv? http://tracker.archiveteam.org/bloopertv/ The wiki page claims there tenfold more users...
08:12 ^🔗	Cameron_D	we stopped to de-dpulicate things, then never started again, I think.
08:12 ^🔗	Lord_Nigh	why not start again and try to split the work to reduce potential duplication?
08:52 ^🔗	ersi	odie5533: Not so much any more, I think he's pretty busy with CodeClub
08:52 ^🔗	ersi	which isn't related to archiving :) his ex-work was highly archive related ;)
09:12 ^🔗	godane	so i just found candywrappermuseum.com
09:13 ^🔗	godane	i'm mirror it right now
09:37 ^🔗	Lord_Nigh	did anyone ever archive http://www.pica-pic.com/ ? its kinda tricky since each flash game downloads its assets separately from within the swf file
09:39 ^🔗	w0rp	I hope Flash will be remembered as being a really bad idea.
09:40 ^🔗	w0rp	(Although emerging from what people actually want, the ability to download and run applications without being plagued by malware all of the time.)
10:13 ^🔗	w0rp	Is there a recommended combination of options for, "Please give me this whole site in WARC" for wget?
11:55 ^🔗	odie5533	ersi: hrm. Do you do warc coding?
12:15 ^🔗	godane	so i need some help tracking down old techtv stuff
12:17 ^🔗	godane	its starting to look like call for help canada maybe more pirated then i thought
12:18 ^🔗	godane	i wonder if any of you guys know of private torrent sites that collection this sort of stuff
12:18 ^🔗	godane	Famicoman: could you help me with this?
12:22 ^🔗	ersi	odie5533: Yes/No.
12:24 ^🔗	odie5533	ersi: what does that mean?
12:25 ^🔗	ersi	It means exactly that.
12:25 ^🔗	ersi	But.. Just ask what you're wondering about WARC instead of asking if someone can look at a question
12:26 ^🔗	odie5533	well I was thinking about how warc writing works, and I was thinking it probably uses a lot of RAM because it seems to store an entire record in memory before writing it out so it can parse the content length for the warc header record.
12:26 ^🔗	odie5533	Additionally, if you are downloading a website, only one record can be written at a time since it writes to a single file.
12:27 ^🔗	odie5533	Both of these make seem to me to make it difficult to download large files/websites using a single warc file as output, so I was considering using multiple warc files as output.
12:27 ^🔗	odie5533	And to output to the file as data is received, and later going back to determine the length of the record.
12:28 ^🔗	odie5533	I believe this would lower ram usage, especially if the WARC downloader is receiving a large file.
12:29 ^🔗	odie5533	I was wondering if someone else had considered this, and if they believe it was a problem that even needed solving
12:31 ^🔗	ersi	I know https://github.com/internetarchive/liveweb does that. They use a output file per thread, if I'm not mistaken
12:33 ^🔗	odie5533	that might just be out of convenience for their thread model
12:34 ^🔗	odie5533	since it's sort of easier to output 1 file per thread rather than writing a message passer to handle output
12:37 ^🔗	ersi	I don't think it's an all that common use case - when crawling sites. But yes, big files can wreck havoc with a crawl like that.. AFAIK the problem with at least `wget` is that it's internal processing of URLs/location tree eats a loot of memory
12:38 ^🔗	odie5533	my current go-to crawler is Scrapy, and afaik it doesn't have that problem, but I've not tried it with quite as many urls as people have put wget through
12:38 ^🔗	odie5533	my thought is that even a 50 or 100 MB file coming down is going to then eat up 50 - 100 MB of memory.
12:39 ^🔗	ersi	kind of neglectable these days though
12:39 ^🔗	odie5533	well, it scales with whatever size file you say
12:40 ^🔗	odie5533	but, yes, it might well be a non-problem which is what I'm wondering.
12:40 ^🔗	ersi	Yeah, of course
12:41 ^🔗	odie5533	I'm leaning towards non-problem at this point. Though my VPS does have limited RAM.
12:41 ^🔗	ersi	It would be nice to have something that can handle big/bigger files on 'less RAM' though
12:41 ^🔗	odie5533	so I thought it would be nice to have a low-RAM downloader.
12:44 ^🔗	odie5533	one drawback is it would require extra read/writes to the disk, both to merge the warcs and to determine the content length.
12:47 ^🔗	odie5533	also is significantly more complicated to write.
14:34 ^🔗	balrog	for reference: http://archive.is/lfJSs (Toyota embedded software issues)
14:44 ^🔗	deathy	trial transcript link sends to non-existing dropbox file :\|
14:47 ^🔗	balrog	deathy: mirror: http://cybergibbons.com/wp-content/uploads/2013/10/Bookout_v_Toyota_Barr_REDACTED.pdf
14:48 ^🔗	deathy	thanks
16:12 ^🔗	phillipsj	I think many VPSs allow you to "burst" Ram usage. not sure how much that helps for downloading the Internet.
16:14 ^🔗	ersi	That's irrelevant though
16:15 ^🔗	yipdw	FWIW, wget does not buffer downloaded data in memory
16:15 ^🔗	yipdw	the main memory usage appears to be what ersi stated
16:19 ^🔗	yipdw	phillipsj: not too much -- when you've got a large wget job, you're going to be using a lot of RAM for a while
16:19 ^🔗	yipdw	by "large" I mean "hundreds of thousands of URLs"
16:19 ^🔗	ersi	like, for a really long time.
16:19 ^🔗	ersi	Especially for a large site :)
16:19 ^🔗	yipdw	phillipsj: it could be some other contributor; I don't think anyone here has actually profiled wget's memory behavior
16:20 ^🔗	*	yipdw should at some point
16:20 ^🔗	ersi	I know alard has somewhat
16:20 ^🔗	ersi	Since he fixed a couple of memleaks
16:20 ^🔗	yipdw	oh, yeah
16:21 ^🔗	yipdw	actually
16:21 ^🔗	yipdw	damn
16:21 ^🔗	yipdw	now I wish ArchiveBot kept max wget memory usage in its job stats
16:22 ^🔗	*	yipdw makes an issue
16:25 ^🔗	phillipsj	some things just don't come up :)
16:27 ^🔗	ersi	hm?
16:28 ^🔗	phillipsj	It's not in the stats because nobody mentioned it, presumably.
16:30 ^🔗	yipdw	no, I just never wrote the code to record it
16:31 ^🔗	yipdw	it's been known as an issue for a while but for some reason I was like "huh, ArchiveBot has 5,000 jobs worth of history"
16:31 ^🔗	yipdw	then it was like "oh fuck me"
16:31 ^🔗	yipdw	:P
19:08 ^🔗	lemonkey	http://www.theverge.com/2013/11/1/5052440/youtube-live-a-disastrous-spectacle-google-would-like-you-to-forget
19:08 ^🔗	lemonkey	choice quote: "Frattini blames a two-year licensing contract, saying the eventâs videos were never meant to stay online for longer than a few years in the first place. But it turns out the conventional wisdom â that whatever you do will stay online forever â can actually be avoided when youâre the people who make the internet."
21:03 ^🔗	Lord_Nigh	http://64scener.com/ will shut down sometime within the next 12 months
21:51 ^🔗	w0rp	Hmm, my warrior is getting "no item received" pretty consistently for the blip.tv project.
22:00 ^🔗	yipdw	w0rp: there's nothing in the queue
23:01 ^🔗	balrog	looking for Canon Canofile software... anyone have any idea where to find it?
23:05 ^🔗	balrog	would be helpful for NeXT MO related stuff
23:21 ^🔗	odie5533	What is Next mo?
23:23 ^🔗	odie5533	balrog: ^
23:24 ^🔗	balrog	magneto-optical disc for the NeXT Computer / NeXT Cubs
23:24 ^🔗	balrog	Cube*
23:26 ^🔗	odie5533	Do you have a NeXT computer?
23:26 ^🔗	balrog	yes
23:26 ^🔗	balrog	I have a cube and a slab
23:26 ^🔗	odie5533	woah
23:26 ^🔗	odie5533	that thing is ancient
23:28 ^🔗	odie5533	it's a giant cube

irclogger-viewer