#archiveteam 2013-11-01,Fri

↑back Search

Time Nickname Message
00:14 πŸ”— balrog I'd like to find a copy of the magazine containing the type-in code for http://www.worldofspectrum.org/infoseekid.cgi?id=0003101
00:14 πŸ”— odie5533 What happens to the Warc info records when you concatenate two warc files?
00:14 πŸ”— balrog (it's a German magazine called Happy Computer, this was "ZX Spectrum Sonderheft 1")
00:16 πŸ”— ivan` odie5533: nothing? I assume IA can import concatenated warcs since they import megawarcs
00:22 πŸ”— odie5533 ivan`: I am thinking of creating a warc file for every record, so I wasn't sure if I should leave off warc info records or not.
00:22 πŸ”— odie5533 probably should
00:26 πŸ”— odie5533 Anyone here write tools for handling warc files and know the format?
00:29 πŸ”— dashcloud probably the best resource: http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem
00:29 πŸ”— dashcloud tons of info on the format and the tools
00:29 πŸ”— godane 2011 and 2012 episodes of Engadget podcast is backup
00:29 πŸ”— odie5533 I have the spec, I'm just wanting someone to bounce ideas off of.
00:30 πŸ”— odie5533 Does tef come around here much?
00:31 πŸ”— dashcloud odie5533: also on the page are a number of tools implementing WARC or doing something to WARC files (lots of source code as well)
00:32 πŸ”— odie5533 dashcloud: yeah, I know. I've written a few of them. But I still feel lonely and want someone to talk to.
07:16 πŸ”— Nemo_bis So, are we done with blip.tv? http://tracker.archiveteam.org/bloopertv/ The wiki page claims there tenfold more users...
08:12 πŸ”— Cameron_D we stopped to de-dpulicate things, then never started again, I think.
08:12 πŸ”— Lord_Nigh why not start again and try to split the work to reduce potential duplication?
08:52 πŸ”— ersi odie5533: Not so much any more, I think he's pretty busy with CodeClub
08:52 πŸ”— ersi which isn't related to archiving :) his ex-work was highly archive related ;)
09:12 πŸ”— godane so i just found candywrappermuseum.com
09:13 πŸ”— godane i'm mirror it right now
09:37 πŸ”— Lord_Nigh did anyone ever archive http://www.pica-pic.com/ ? its kinda tricky since each flash game downloads its assets separately from within the swf file
09:39 πŸ”— w0rp I hope Flash will be remembered as being a really bad idea.
09:40 πŸ”— w0rp (Although emerging from what people actually want, the ability to download and run applications without being plagued by malware all of the time.)
10:13 πŸ”— w0rp Is there a recommended combination of options for, "Please give me this whole site in WARC" for wget?
11:55 πŸ”— odie5533 ersi: hrm. Do you do warc coding?
12:15 πŸ”— godane so i need some help tracking down old techtv stuff
12:17 πŸ”— godane its starting to look like call for help canada maybe more pirated then i thought
12:18 πŸ”— godane i wonder if any of you guys know of private torrent sites that collection this sort of stuff
12:18 πŸ”— godane Famicoman: could you help me with this?
12:22 πŸ”— ersi odie5533: Yes/No.
12:24 πŸ”— odie5533 ersi: what does that mean?
12:25 πŸ”— ersi It means exactly that.
12:25 πŸ”— ersi But.. Just ask what you're wondering about WARC instead of asking if someone can look at a question
12:26 πŸ”— odie5533 well I was thinking about how warc writing works, and I was thinking it probably uses a lot of RAM because it seems to store an entire record in memory before writing it out so it can parse the content length for the warc header record.
12:26 πŸ”— odie5533 Additionally, if you are downloading a website, only one record can be written at a time since it writes to a single file.
12:27 πŸ”— odie5533 Both of these make seem to me to make it difficult to download large files/websites using a single warc file as output, so I was considering using multiple warc files as output.
12:27 πŸ”— odie5533 And to output to the file as data is received, and later going back to determine the length of the record.
12:28 πŸ”— odie5533 I believe this would lower ram usage, especially if the WARC downloader is receiving a large file.
12:29 πŸ”— odie5533 I was wondering if someone else had considered this, and if they believe it was a problem that even needed solving
12:31 πŸ”— ersi I know https://github.com/internetarchive/liveweb does that. They use a output file per thread, if I'm not mistaken
12:33 πŸ”— odie5533 that might just be out of convenience for their thread model
12:34 πŸ”— odie5533 since it's sort of easier to output 1 file per thread rather than writing a message passer to handle output
12:37 πŸ”— ersi I don't think it's an all that common use case - when crawling sites. But yes, big files can wreck havoc with a crawl like that.. AFAIK the problem with at least `wget` is that it's internal processing of URLs/location tree eats a loot of memory
12:38 πŸ”— odie5533 my current go-to crawler is Scrapy, and afaik it doesn't have that problem, but I've not tried it with quite as many urls as people have put wget through
12:38 πŸ”— odie5533 my thought is that even a 50 or 100 MB file coming down is going to then eat up 50 - 100 MB of memory.
12:39 πŸ”— ersi kind of neglectable these days though
12:39 πŸ”— odie5533 well, it scales with whatever size file you say
12:40 πŸ”— odie5533 but, yes, it might well be a non-problem which is what I'm wondering.
12:40 πŸ”— ersi Yeah, of course
12:41 πŸ”— odie5533 I'm leaning towards non-problem at this point. Though my VPS does have limited RAM.
12:41 πŸ”— ersi It would be nice to have something that can handle big/bigger files on 'less RAM' though
12:41 πŸ”— odie5533 so I thought it would be nice to have a low-RAM downloader.
12:44 πŸ”— odie5533 one drawback is it would require extra read/writes to the disk, both to merge the warcs and to determine the content length.
12:47 πŸ”— odie5533 also is significantly more complicated to write.
14:34 πŸ”— balrog for reference: http://archive.is/lfJSs (Toyota embedded software issues)
14:44 πŸ”— deathy trial transcript link sends to non-existing dropbox file :|
14:47 πŸ”— balrog deathy: mirror: http://cybergibbons.com/wp-content/uploads/2013/10/Bookout_v_Toyota_Barr_REDACTED.pdf
14:48 πŸ”— deathy thanks
16:12 πŸ”— phillipsj I think many VPSs allow you to "burst" Ram usage. not sure how much that helps for downloading the Internet.
16:14 πŸ”— ersi That's irrelevant though
16:15 πŸ”— yipdw FWIW, wget does not buffer downloaded data in memory
16:15 πŸ”— yipdw the main memory usage appears to be what ersi stated
16:19 πŸ”— yipdw phillipsj: not too much -- when you've got a large wget job, you're going to be using a lot of RAM for a while
16:19 πŸ”— yipdw by "large" I mean "hundreds of thousands of URLs"
16:19 πŸ”— ersi like, for a really long time.
16:19 πŸ”— ersi Especially for a large site :)
16:19 πŸ”— yipdw phillipsj: it could be some other contributor; I don't think anyone here has actually profiled wget's memory behavior
16:20 πŸ”— * yipdw should at some point
16:20 πŸ”— ersi I know alard has somewhat
16:20 πŸ”— ersi Since he fixed a couple of memleaks
16:20 πŸ”— yipdw oh, yeah
16:21 πŸ”— yipdw actually
16:21 πŸ”— yipdw damn
16:21 πŸ”— yipdw now I wish ArchiveBot kept max wget memory usage in its job stats
16:22 πŸ”— * yipdw makes an issue
16:25 πŸ”— phillipsj some things just don't come up :)
16:27 πŸ”— ersi hm?
16:28 πŸ”— phillipsj It's not in the stats because nobody mentioned it, presumably.
16:30 πŸ”— yipdw no, I just never wrote the code to record it
16:31 πŸ”— yipdw it's been known as an issue for a while but for some reason I was like "huh, ArchiveBot has 5,000 jobs worth of history"
16:31 πŸ”— yipdw then it was like "oh fuck me"
16:31 πŸ”— yipdw :P
19:08 πŸ”— lemonkey http://www.theverge.com/2013/11/1/5052440/youtube-live-a-disastrous-spectacle-google-would-like-you-to-forget
19:08 πŸ”— lemonkey choice quote: "Frattini blames a two-year licensing contract, saying the eventҀ™s videos were never meant to stay online for longer than a few years in the first place. But it turns out the conventional wisdom Ҁ” that whatever you do will stay online forever Ҁ” can actually be avoided when youҀ™re the people who make the internet."
21:03 πŸ”— Lord_Nigh http://64scener.com/ will shut down sometime within the next 12 months
21:51 πŸ”— w0rp Hmm, my warrior is getting "no item received" pretty consistently for the blip.tv project.
22:00 πŸ”— yipdw w0rp: there's nothing in the queue
23:01 πŸ”— balrog looking for Canon Canofile software... anyone have any idea where to find it?
23:05 πŸ”— balrog would be helpful for NeXT MO related stuff
23:21 πŸ”— odie5533 What is Next mo?
23:23 πŸ”— odie5533 balrog: ^
23:24 πŸ”— balrog magneto-optical disc for the NeXT Computer / NeXT Cubs
23:24 πŸ”— balrog Cube*
23:26 πŸ”— odie5533 Do you have a NeXT computer?
23:26 πŸ”— balrog yes
23:26 πŸ”— balrog I have a cube and a slab
23:26 πŸ”— odie5533 woah
23:26 πŸ”— odie5533 that thing is ancient
23:28 πŸ”— odie5533 it's a giant cube

irclogger-viewer