Time |
Nickname |
Message |
00:14
π
|
balrog |
I'd like to find a copy of the magazine containing the type-in code for http://www.worldofspectrum.org/infoseekid.cgi?id=0003101 |
00:14
π
|
odie5533 |
What happens to the Warc info records when you concatenate two warc files? |
00:14
π
|
balrog |
(it's a German magazine called Happy Computer, this was "ZX Spectrum Sonderheft 1") |
00:16
π
|
ivan` |
odie5533: nothing? I assume IA can import concatenated warcs since they import megawarcs |
00:22
π
|
odie5533 |
ivan`: I am thinking of creating a warc file for every record, so I wasn't sure if I should leave off warc info records or not. |
00:22
π
|
odie5533 |
probably should |
00:26
π
|
odie5533 |
Anyone here write tools for handling warc files and know the format? |
00:29
π
|
dashcloud |
probably the best resource: http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem |
00:29
π
|
dashcloud |
tons of info on the format and the tools |
00:29
π
|
godane |
2011 and 2012 episodes of Engadget podcast is backup |
00:29
π
|
odie5533 |
I have the spec, I'm just wanting someone to bounce ideas off of. |
00:30
π
|
odie5533 |
Does tef come around here much? |
00:31
π
|
dashcloud |
odie5533: also on the page are a number of tools implementing WARC or doing something to WARC files (lots of source code as well) |
00:32
π
|
odie5533 |
dashcloud: yeah, I know. I've written a few of them. But I still feel lonely and want someone to talk to. |
07:16
π
|
Nemo_bis |
So, are we done with blip.tv? http://tracker.archiveteam.org/bloopertv/ The wiki page claims there tenfold more users... |
08:12
π
|
Cameron_D |
we stopped to de-dpulicate things, then never started again, I think. |
08:12
π
|
Lord_Nigh |
why not start again and try to split the work to reduce potential duplication? |
08:52
π
|
ersi |
odie5533: Not so much any more, I think he's pretty busy with CodeClub |
08:52
π
|
ersi |
which isn't related to archiving :) his ex-work was highly archive related ;) |
09:12
π
|
godane |
so i just found candywrappermuseum.com |
09:13
π
|
godane |
i'm mirror it right now |
09:37
π
|
Lord_Nigh |
did anyone ever archive http://www.pica-pic.com/ ? its kinda tricky since each flash game downloads its assets separately from within the swf file |
09:39
π
|
w0rp |
I hope Flash will be remembered as being a really bad idea. |
09:40
π
|
w0rp |
(Although emerging from what people actually want, the ability to download and run applications without being plagued by malware all of the time.) |
10:13
π
|
w0rp |
Is there a recommended combination of options for, "Please give me this whole site in WARC" for wget? |
11:55
π
|
odie5533 |
ersi: hrm. Do you do warc coding? |
12:15
π
|
godane |
so i need some help tracking down old techtv stuff |
12:17
π
|
godane |
its starting to look like call for help canada maybe more pirated then i thought |
12:18
π
|
godane |
i wonder if any of you guys know of private torrent sites that collection this sort of stuff |
12:18
π
|
godane |
Famicoman: could you help me with this? |
12:22
π
|
ersi |
odie5533: Yes/No. |
12:24
π
|
odie5533 |
ersi: what does that mean? |
12:25
π
|
ersi |
It means exactly that. |
12:25
π
|
ersi |
But.. Just ask what you're wondering about WARC instead of asking if someone can look at a question |
12:26
π
|
odie5533 |
well I was thinking about how warc writing works, and I was thinking it probably uses a lot of RAM because it seems to store an entire record in memory before writing it out so it can parse the content length for the warc header record. |
12:26
π
|
odie5533 |
Additionally, if you are downloading a website, only one record can be written at a time since it writes to a single file. |
12:27
π
|
odie5533 |
Both of these make seem to me to make it difficult to download large files/websites using a single warc file as output, so I was considering using multiple warc files as output. |
12:27
π
|
odie5533 |
And to output to the file as data is received, and later going back to determine the length of the record. |
12:28
π
|
odie5533 |
I believe this would lower ram usage, especially if the WARC downloader is receiving a large file. |
12:29
π
|
odie5533 |
I was wondering if someone else had considered this, and if they believe it was a problem that even needed solving |
12:31
π
|
ersi |
I know https://github.com/internetarchive/liveweb does that. They use a output file per thread, if I'm not mistaken |
12:33
π
|
odie5533 |
that might just be out of convenience for their thread model |
12:34
π
|
odie5533 |
since it's sort of easier to output 1 file per thread rather than writing a message passer to handle output |
12:37
π
|
ersi |
I don't think it's an all that common use case - when crawling sites. But yes, big files can wreck havoc with a crawl like that.. AFAIK the problem with at least `wget` is that it's internal processing of URLs/location tree eats a loot of memory |
12:38
π
|
odie5533 |
my current go-to crawler is Scrapy, and afaik it doesn't have that problem, but I've not tried it with quite as many urls as people have put wget through |
12:38
π
|
odie5533 |
my thought is that even a 50 or 100 MB file coming down is going to then eat up 50 - 100 MB of memory. |
12:39
π
|
ersi |
kind of neglectable these days though |
12:39
π
|
odie5533 |
well, it scales with whatever size file you say |
12:40
π
|
odie5533 |
but, yes, it might well be a non-problem which is what I'm wondering. |
12:40
π
|
ersi |
Yeah, of course |
12:41
π
|
odie5533 |
I'm leaning towards non-problem at this point. Though my VPS does have limited RAM. |
12:41
π
|
ersi |
It would be nice to have something that can handle big/bigger files on 'less RAM' though |
12:41
π
|
odie5533 |
so I thought it would be nice to have a low-RAM downloader. |
12:44
π
|
odie5533 |
one drawback is it would require extra read/writes to the disk, both to merge the warcs and to determine the content length. |
12:47
π
|
odie5533 |
also is significantly more complicated to write. |
14:34
π
|
balrog |
for reference: http://archive.is/lfJSs (Toyota embedded software issues) |
14:44
π
|
deathy |
trial transcript link sends to non-existing dropbox file :| |
14:47
π
|
balrog |
deathy: mirror: http://cybergibbons.com/wp-content/uploads/2013/10/Bookout_v_Toyota_Barr_REDACTED.pdf |
14:48
π
|
deathy |
thanks |
16:12
π
|
phillipsj |
I think many VPSs allow you to "burst" Ram usage. not sure how much that helps for downloading the Internet. |
16:14
π
|
ersi |
That's irrelevant though |
16:15
π
|
yipdw |
FWIW, wget does not buffer downloaded data in memory |
16:15
π
|
yipdw |
the main memory usage appears to be what ersi stated |
16:19
π
|
yipdw |
phillipsj: not too much -- when you've got a large wget job, you're going to be using a lot of RAM for a while |
16:19
π
|
yipdw |
by "large" I mean "hundreds of thousands of URLs" |
16:19
π
|
ersi |
like, for a really long time. |
16:19
π
|
ersi |
Especially for a large site :) |
16:19
π
|
yipdw |
phillipsj: it could be some other contributor; I don't think anyone here has actually profiled wget's memory behavior |
16:20
π
|
* |
yipdw should at some point |
16:20
π
|
ersi |
I know alard has somewhat |
16:20
π
|
ersi |
Since he fixed a couple of memleaks |
16:20
π
|
yipdw |
oh, yeah |
16:21
π
|
yipdw |
actually |
16:21
π
|
yipdw |
damn |
16:21
π
|
yipdw |
now I wish ArchiveBot kept max wget memory usage in its job stats |
16:22
π
|
* |
yipdw makes an issue |
16:25
π
|
phillipsj |
some things just don't come up :) |
16:27
π
|
ersi |
hm? |
16:28
π
|
phillipsj |
It's not in the stats because nobody mentioned it, presumably. |
16:30
π
|
yipdw |
no, I just never wrote the code to record it |
16:31
π
|
yipdw |
it's been known as an issue for a while but for some reason I was like "huh, ArchiveBot has 5,000 jobs worth of history" |
16:31
π
|
yipdw |
then it was like "oh fuck me" |
16:31
π
|
yipdw |
:P |
19:08
π
|
lemonkey |
http://www.theverge.com/2013/11/1/5052440/youtube-live-a-disastrous-spectacle-google-would-like-you-to-forget |
19:08
π
|
lemonkey |
choice quote: "Frattini blames a two-year licensing contract, saying the eventΓ’ΒΒs videos were never meant to stay online for longer than a few years in the first place. But it turns out the conventional wisdom Γ’ΒΒ that whatever you do will stay online forever Γ’ΒΒ can actually be avoided when youΓ’ΒΒre the people who make the internet." |
21:03
π
|
Lord_Nigh |
http://64scener.com/ will shut down sometime within the next 12 months |
21:51
π
|
w0rp |
Hmm, my warrior is getting "no item received" pretty consistently for the blip.tv project. |
22:00
π
|
yipdw |
w0rp: there's nothing in the queue |
23:01
π
|
balrog |
looking for Canon Canofile software... anyone have any idea where to find it? |
23:05
π
|
balrog |
would be helpful for NeXT MO related stuff |
23:21
π
|
odie5533 |
What is Next mo? |
23:23
π
|
odie5533 |
balrog: ^ |
23:24
π
|
balrog |
magneto-optical disc for the NeXT Computer / NeXT Cubs |
23:24
π
|
balrog |
Cube* |
23:26
π
|
odie5533 |
Do you have a NeXT computer? |
23:26
π
|
balrog |
yes |
23:26
π
|
balrog |
I have a cube and a slab |
23:26
π
|
odie5533 |
woah |
23:26
π
|
odie5533 |
that thing is ancient |
23:28
π
|
odie5533 |
it's a giant cube |