[00:14] I'd like to find a copy of the magazine containing the type-in code for http://www.worldofspectrum.org/infoseekid.cgi?id=0003101 [00:14] What happens to the Warc info records when you concatenate two warc files? [00:14] (it's a German magazine called Happy Computer, this was "ZX Spectrum Sonderheft 1") [00:16] odie5533: nothing? I assume IA can import concatenated warcs since they import megawarcs [00:22] ivan`: I am thinking of creating a warc file for every record, so I wasn't sure if I should leave off warc info records or not. [00:22] probably should [00:26] Anyone here write tools for handling warc files and know the format? [00:29] probably the best resource: http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem [00:29] tons of info on the format and the tools [00:29] 2011 and 2012 episodes of Engadget podcast is backup [00:29] I have the spec, I'm just wanting someone to bounce ideas off of. [00:30] Does tef come around here much? [00:31] odie5533: also on the page are a number of tools implementing WARC or doing something to WARC files (lots of source code as well) [00:32] dashcloud: yeah, I know. I've written a few of them. But I still feel lonely and want someone to talk to. [07:16] So, are we done with blip.tv? http://tracker.archiveteam.org/bloopertv/ The wiki page claims there tenfold more users... [08:12] we stopped to de-dpulicate things, then never started again, I think. [08:12] why not start again and try to split the work to reduce potential duplication? [08:52] odie5533: Not so much any more, I think he's pretty busy with CodeClub [08:52] which isn't related to archiving :) his ex-work was highly archive related ;) [09:12] so i just found candywrappermuseum.com [09:13] i'm mirror it right now [09:37] did anyone ever archive http://www.pica-pic.com/ ? its kinda tricky since each flash game downloads its assets separately from within the swf file [09:39] I hope Flash will be remembered as being a really bad idea. [09:40] (Although emerging from what people actually want, the ability to download and run applications without being plagued by malware all of the time.) [10:13] Is there a recommended combination of options for, "Please give me this whole site in WARC" for wget? [11:55] ersi: hrm. Do you do warc coding? [12:15] so i need some help tracking down old techtv stuff [12:17] its starting to look like call for help canada maybe more pirated then i thought [12:18] i wonder if any of you guys know of private torrent sites that collection this sort of stuff [12:18] Famicoman: could you help me with this? [12:22] odie5533: Yes/No. [12:24] ersi: what does that mean? [12:25] It means exactly that. [12:25] But.. Just ask what you're wondering about WARC instead of asking if someone can look at a question [12:26] well I was thinking about how warc writing works, and I was thinking it probably uses a lot of RAM because it seems to store an entire record in memory before writing it out so it can parse the content length for the warc header record. [12:26] Additionally, if you are downloading a website, only one record can be written at a time since it writes to a single file. [12:27] Both of these make seem to me to make it difficult to download large files/websites using a single warc file as output, so I was considering using multiple warc files as output. [12:27] And to output to the file as data is received, and later going back to determine the length of the record. [12:28] I believe this would lower ram usage, especially if the WARC downloader is receiving a large file. [12:29] I was wondering if someone else had considered this, and if they believe it was a problem that even needed solving [12:31] I know https://github.com/internetarchive/liveweb does that. They use a output file per thread, if I'm not mistaken [12:33] that might just be out of convenience for their thread model [12:34] since it's sort of easier to output 1 file per thread rather than writing a message passer to handle output [12:37] I don't think it's an all that common use case - when crawling sites. But yes, big files can wreck havoc with a crawl like that.. AFAIK the problem with at least `wget` is that it's internal processing of URLs/location tree eats a loot of memory [12:38] my current go-to crawler is Scrapy, and afaik it doesn't have that problem, but I've not tried it with quite as many urls as people have put wget through [12:38] my thought is that even a 50 or 100 MB file coming down is going to then eat up 50 - 100 MB of memory. [12:39] kind of neglectable these days though [12:39] well, it scales with whatever size file you say [12:40] but, yes, it might well be a non-problem which is what I'm wondering. [12:40] Yeah, of course [12:41] I'm leaning towards non-problem at this point. Though my VPS does have limited RAM. [12:41] It would be nice to have something that can handle big/bigger files on 'less RAM' though [12:41] so I thought it would be nice to have a low-RAM downloader. [12:44] one drawback is it would require extra read/writes to the disk, both to merge the warcs and to determine the content length. [12:47] also is significantly more complicated to write. [14:34] for reference: http://archive.is/lfJSs (Toyota embedded software issues) [14:44] trial transcript link sends to non-existing dropbox file :| [14:47] deathy: mirror: http://cybergibbons.com/wp-content/uploads/2013/10/Bookout_v_Toyota_Barr_REDACTED.pdf [14:48] thanks [16:12] I think many VPSs allow you to "burst" Ram usage. not sure how much that helps for downloading the Internet. [16:14] That's irrelevant though [16:15] FWIW, wget does not buffer downloaded data in memory [16:15] the main memory usage appears to be what ersi stated [16:19] phillipsj: not too much -- when you've got a large wget job, you're going to be using a lot of RAM for a while [16:19] by "large" I mean "hundreds of thousands of URLs" [16:19] like, for a really long time. [16:19] Especially for a large site :) [16:19] phillipsj: it could be some other contributor; I don't think anyone here has actually profiled wget's memory behavior [16:20] * yipdw should at some point [16:20] I know alard has somewhat [16:20] Since he fixed a couple of memleaks [16:20] oh, yeah [16:21] actually [16:21] damn [16:21] now I wish ArchiveBot kept max wget memory usage in its job stats [16:22] * yipdw makes an issue [16:25] some things just don't come up :) [16:27] hm? [16:28] It's not in the stats because nobody mentioned it, presumably. [16:30] no, I just never wrote the code to record it [16:31] it's been known as an issue for a while but for some reason I was like "huh, ArchiveBot has 5,000 jobs worth of history" [16:31] then it was like "oh fuck me" [16:31] :P [19:08] http://www.theverge.com/2013/11/1/5052440/youtube-live-a-disastrous-spectacle-google-would-like-you-to-forget [19:08] choice quote: "Frattini blames a two-year licensing contract, saying the event’s videos were never meant to stay online for longer than a few years in the first place. But it turns out the conventional wisdom — that whatever you do will stay online forever — can actually be avoided when you’re the people who make the internet." [21:03] http://64scener.com/ will shut down sometime within the next 12 months [21:51] Hmm, my warrior is getting "no item received" pretty consistently for the blip.tv project. [22:00] w0rp: there's nothing in the queue [23:01] looking for Canon Canofile software... anyone have any idea where to find it? [23:05] would be helpful for NeXT MO related stuff [23:21] What is Next mo? [23:23] balrog: ^ [23:24] magneto-optical disc for the NeXT Computer / NeXT Cubs [23:24] Cube* [23:26] Do you have a NeXT computer? [23:26] yes [23:26] I have a cube and a slab [23:26] woah [23:26] that thing is ancient [23:28] it's a giant cube