[00:14] <balrog> I'd like to find a copy of the magazine containing the type-in code for http://www.worldofspectrum.org/infoseekid.cgi?id=0003101
[00:14] <odie5533> What happens to the Warc info records when you concatenate two warc files?
[00:14] <balrog> (it's a German magazine called Happy Computer, this was "ZX Spectrum Sonderheft 1")
[00:16] <ivan`> odie5533: nothing? I assume IA can import concatenated warcs since they import megawarcs
[00:22] <odie5533> ivan`: I am thinking of creating a warc file for every record, so I wasn't sure if I should leave off warc info records or not.
[00:22] <odie5533> probably should
[00:26] <odie5533> Anyone here write tools for handling warc files and know the format?
[00:29] <dashcloud> probably the best resource: http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem
[00:29] <dashcloud> tons of info on the format and the tools
[00:29] <godane> 2011 and 2012 episodes of Engadget podcast is backup
[00:29] <odie5533> I have the spec, I'm just wanting someone to bounce ideas off of.
[00:30] <odie5533> Does tef come around here much?
[00:31] <dashcloud> odie5533: also on the page are a number of tools implementing WARC or doing something to WARC files (lots of source code as well)
[00:32] <odie5533> dashcloud: yeah, I know. I've written a few of them. But I still feel lonely and want someone to talk to.
[07:16] <Nemo_bis> So, are we done with blip.tv? http://tracker.archiveteam.org/bloopertv/ The wiki page claims there tenfold more users...
[08:12] <Cameron_D> we stopped to de-dpulicate things, then never started again, I think.
[08:12] <Lord_Nigh> why not start again and try to split the work to reduce potential duplication?
[08:52] <ersi> odie5533: Not so much any more, I think he's pretty busy with CodeClub
[08:52] <ersi> which isn't related to archiving :) his ex-work was highly archive related ;)
[09:12] <godane> so i just found candywrappermuseum.com
[09:13] <godane> i'm mirror it right now
[09:37] <Lord_Nigh> did anyone ever archive http://www.pica-pic.com/ ? its kinda tricky since each flash game downloads its assets separately from within the swf file
[09:39] <w0rp> I hope Flash will be remembered as being a really bad idea.
[09:40] <w0rp> (Although emerging from what people actually want, the ability to download and run applications without being plagued by malware all of the time.)
[10:13] <w0rp> Is there a recommended combination of options for, "Please give me this whole site in WARC" for wget?
[11:55] <odie5533> ersi: hrm. Do you do warc coding?
[12:15] <godane> so i need some help tracking down old techtv stuff
[12:17] <godane> its starting to look like call for help canada maybe more pirated then i thought
[12:18] <godane> i wonder if any of you guys know of private torrent sites that collection this sort of stuff
[12:18] <godane> Famicoman: could you help me with this?
[12:22] <ersi> odie5533: Yes/No.
[12:24] <odie5533> ersi: what does that mean?
[12:25] <ersi> It means exactly that.
[12:25] <ersi> But.. Just ask what you're wondering about WARC instead of asking if someone can look at a question
[12:26] <odie5533> well I was thinking about how warc writing works, and I was thinking it probably uses a lot of RAM because it seems to store an entire record in memory before writing it out so it can parse the content length for the warc header record.
[12:26] <odie5533> Additionally, if you are downloading a website, only one record can be written at a time since it writes to a single file.
[12:27] <odie5533> Both of these make seem to me to make it difficult to download large files/websites using a single warc file as output, so I was considering using multiple warc files as output.
[12:27] <odie5533> And to output to the file as data is received, and later going back to determine the length of the record.
[12:28] <odie5533> I believe this would lower ram usage, especially if the WARC downloader is receiving a large file.
[12:29] <odie5533> I was wondering if someone else had considered this, and if they believe it was a problem that even needed solving
[12:31] <ersi> I know https://github.com/internetarchive/liveweb does that. They use a output file per thread, if I'm not mistaken
[12:33] <odie5533> that might just be out of convenience for their thread model
[12:34] <odie5533> since it's sort of easier to output 1 file per thread rather than writing a message passer to handle output
[12:37] <ersi> I don't think it's an all that common use case - when crawling sites. But yes, big files can wreck havoc with a crawl like that.. AFAIK the problem with at least `wget` is that it's internal processing of URLs/location tree eats a loot of memory
[12:38] <odie5533> my current go-to crawler is Scrapy, and afaik it doesn't have that problem, but I've not tried it with quite as many urls as people have put wget through
[12:38] <odie5533> my thought is that even a 50 or 100 MB file coming down is going to then eat up 50 - 100 MB of memory.
[12:39] <ersi> kind of neglectable these days though
[12:39] <odie5533> well, it scales with whatever size file you say
[12:40] <odie5533> but, yes, it might well be a non-problem which is what I'm wondering.
[12:40] <ersi> Yeah, of course
[12:41] <odie5533> I'm leaning towards non-problem at this point. Though my VPS does have limited RAM.
[12:41] <ersi> It would be nice to have something that can handle big/bigger files on 'less RAM' though
[12:41] <odie5533> so I thought it would be nice to have a low-RAM downloader.
[12:44] <odie5533> one drawback is it would require extra read/writes to the disk, both to merge the warcs and to determine the content length.
[12:47] <odie5533> also is significantly more complicated to write.
[14:34] <balrog> for reference: http://archive.is/lfJSs (Toyota embedded software issues)
[14:44] <deathy> trial transcript link sends to non-existing dropbox file :|
[14:47] <balrog> deathy: mirror: http://cybergibbons.com/wp-content/uploads/2013/10/Bookout_v_Toyota_Barr_REDACTED.pdf
[14:48] <deathy> thanks
[16:12] <phillipsj> I think many VPSs allow you to "burst" Ram usage. not sure how much that helps for downloading the Internet.
[16:14] <ersi> That's irrelevant though
[16:15] <yipdw> FWIW, wget does not buffer downloaded data in memory
[16:15] <yipdw> the main memory usage appears to be what ersi stated
[16:19] <yipdw> phillipsj: not too much -- when you've got a large wget job, you're going to be using a lot of RAM for a while
[16:19] <yipdw> by "large" I mean "hundreds of thousands of URLs"
[16:19] <ersi> like, for a really long time.
[16:19] <ersi> Especially for a large site :)
[16:19] <yipdw> phillipsj: it could be some other contributor; I don't think anyone here has actually profiled wget's memory behavior
[16:20] * yipdw should at some point
[16:20] <ersi> I know alard has somewhat
[16:20] <ersi> Since he fixed a couple of memleaks
[16:20] <yipdw> oh, yeah
[16:21] <yipdw> actually
[16:21] <yipdw> damn
[16:21] <yipdw> now I wish ArchiveBot kept max wget memory usage in its job stats
[16:22] * yipdw makes an issue
[16:25] <phillipsj> some things just don't come up :)
[16:27] <ersi> hm?
[16:28] <phillipsj> It's not in the stats because nobody mentioned it, presumably.
[16:30] <yipdw> no, I just never wrote the code to record it
[16:31] <yipdw> it's been known as an issue for a while but for some reason I was like "huh, ArchiveBot has 5,000 jobs worth of history"
[16:31] <yipdw> then it was like "oh fuck me"
[16:31] <yipdw> :P
[19:08] <lemonkey> http://www.theverge.com/2013/11/1/5052440/youtube-live-a-disastrous-spectacle-google-would-like-you-to-forget
[19:08] <lemonkey> choice quote: "Frattini blames a two-year licensing contract, saying the eventâs videos were never meant to stay online for longer than a few years in the first place. But it turns out the conventional wisdom â that whatever you do will stay online forever â can actually be avoided when youâre the people who make the internet."
[21:03] <Lord_Nigh> http://64scener.com/ will shut down sometime within the next 12 months
[21:51] <w0rp> Hmm, my warrior is getting "no item received" pretty consistently for the blip.tv project.
[22:00] <yipdw> w0rp: there's nothing in the queue
[23:01] <balrog> looking for Canon Canofile software... anyone have any idea where to find it?
[23:05] <balrog> would be helpful for NeXT MO related stuff
[23:21] <odie5533> What is Next mo?
[23:23] <odie5533> balrog: ^
[23:24] <balrog> magneto-optical disc for the NeXT Computer / NeXT Cubs
[23:24] <balrog> Cube*
[23:26] <odie5533> Do you have a NeXT computer?
[23:26] <balrog> yes
[23:26] <balrog> I have a cube and a slab
[23:26] <odie5533> woah
[23:26] <odie5533> that thing is ancient
[23:28] <odie5533> it's a giant cube