[00:00] will do, thanks [00:01] one other thing: I don't have space for such a large file atm, but if you have a smaller file (couple of gigs max) I can test some stuff [00:02] I'm working on these: https://archive.org/details/archiveteam_upcoming [00:02] they're all pretty big though, 25GB each [00:07] yeah that's a bit big for now :/ [00:08] I can use the `Megawarc` utility to restore the megawarc.warc to a TAR of .warc files, then use warctozip to convert them to ZIPs full of files, and then extract those [00:09] I just wonder if they're something more efficient [00:09] afaict warcat should be able to extract megawarcs [00:09] seems like something that would've come up, trying to reconstruct an Archive Team save into a static archive [00:09] I can't really try it from here since I'm currently starved for storage :/ [00:10] I'll give it a shot, thanks [00:13] waxpancak: good luck [00:13] keep us updated :) [00:13] have to switch to a server where I can install python3 to try it [00:13] but that's no big deal, crossing my fingers [00:17] * nico have build python3 into his ~archivebot directory [00:17] debian's version is too old for yipdw's code :) [00:18] I should mandate an OS version [00:18] it seems that even with Python's abstractions that stuff still comes back to do weird shit [00:19] let's make a coreos image ! [00:20] when archivebot was using wget-lua, i had some strange bug seen only by me because the vps was x86_64 [00:20] urgh, javascript is weird [00:21] wpull is a real improvement [00:21] danneh_: javascript is php on the client side [00:22] nico: sounds like undefined C behavious [00:22] behaviour* [00:23] which should be fixed :/ [00:23] good luck fixing the wget code [00:24] nico: I've fixed wget code before [00:25] less obscure possibility: you run out of memory faster in 64-bit mode [00:25] anyway off-topic siren [00:25] without the kernel complaining? [00:30] Alright then, downloading a bunch of these json files: http://h18000.www1.hp.com/cpq-products/quickspecs/soc/soc_80798.json numbered from 80001 to 80798 apparently [00:30] After that I'll scrape those for PDF links and grab those too [00:31] danneh_: good job :D [00:31] balrog: thanks :D [00:31] tip: don't worry about the JS at all, just open up Chrome's Networking panel and see what it requests from there instead [00:33] look like soc_80000.json is the men [00:33] menu [00:36] hmm, there's also stuff like: http://h18000.www1.hp.com/cpq-products/quickspecs/14907_div/14907_div.json and http://h18000.www1.hp.com/cpq-products/quickspecs/division/10991_div.json scattered around [00:36] Lots of different places for things [00:39] oh man, warcat is THE BEST [00:40] it's tearing through a 25GB megawarc right now. This makes the process *so much simpler* [00:40] ./opt/python3/bin/python3 -m warcat extract upcoming_20130425064232.megawarc.warc.gz --output-dir ./tmp/ --progress [00:42] didn't someone previously grab HP's entire FTP and upload it to IA? Couldn't you just break out all of the PDFs from that dump if it exists? (this is assuming the destination for the PDFs is actually on HP's FTP) [00:42] And it's splitting it by hostname, effectively reconstructing the site (and the assets it linked to) as they were before it died. This is exactly what I wanted. [00:43] ;-) [00:45] waxpancak: would you mind writing up the process from start to finish so it can be placed on the wiki for others who might like to do the same thing? [00:46] Happy to. [00:46] I don't know of anyone else who's actually done something like this with a WARC before [00:46] I've extracted WARCs using the unarchiver and it produces basically that sort of output. [00:46] but those were small warcs that I created, usually. [00:47] or mobileme warcs [00:47] I acquired the upcoming.org domain back from Yahoo and I'm trying to put the historical archives back at their original URLs [00:47] dashcloud: from experience with finding ex-DEC stuff on HP's site, I wouldn't want to bet on it being in the ftp.hp.com dump [00:47] waxpancak: I'm aware. Good work with that, btw! :) [00:48] I'm assuming those archives will just be static, correct? [00:48] yeah, just a historical archive with a minimalist design [00:49] I can't thank all of you enough, it was pretty amazing to watch the grab when it happened a year ago [00:49] i'm starting to upload season 14 of the joy of painting [01:00] thank you for being a generally awesome person [01:20] Just to keep up, here's what I'm currently archiving, will add to it as I go find more links (and if you guys find more, ping me!): http://pastebin.com/g3v3fDhh [01:20] (FWIW http://wakaba.c3.cx/s/apps/unarchiver has WARC support, since I insisted ;) ) [01:20] DFJustin: any comments on that? [01:20] haha sweet, I was gonna bug him at some point [01:20] ah :P [01:25] hmm it only seems to work with .warc and not .warc.gz though [01:27] DFJustin: it should un-gzip the .gz first [01:30] well like doing `lsar blahblah.warc.gz` just shows blahblah.warc [01:30] which is inconvenient [02:45] Warcat is saving everything exactly how I need, but this is going to take *forever*. It's been cranking on a single 25GB .megawarc (one of 142 saved) for the last two hours, but it's only extracted 1.3G of uncompressed files so far [02:46] chfoo: Any advice? [06:43] so i'm starting to grab more 60 minutes rewind episodes [06:56] Shhh, no hugging, archive team is supposed to be mean [06:56] Greets from Copenhagen, soon Sweden [07:02] i just got the cbs state of the union webcast from 2011 [07:02] its the 'after show' report of the state of the union [07:02] that was only online i think [07:04] season 16 of the joy of painting is going up [07:06] Godane, there's no way the Joy of Painting will survive. [07:07] ok [07:07] i thought it would [07:08] the guy is sort of dead [07:19] https://www.bobross.com/ [07:19] A very active, very profitable, very involved company that still sells the shows. [07:19] SketchCow: I installed Python 3 to my local user directory so I could get warcat running on your server [07:20] it's running unbearably slow on mine, hoping yours is a bit beefier [07:20] if not, I'll have to get some Amazon instances running [07:20] or figure out some other plan [07:21] if thats the case then i will stop [07:21] It's still been cranking along on the first 25GB megawarc, but only processed 2.5GB of data, so who knows. Maybe chfoo will have some tips when he wakes up [07:22] heading to bed, maybe it'll finish overnight [07:25] https://www.bobross.com/gifts.cfm [07:25] $1,625 for entire bob ross series on DVD [07:26] i noticed that [07:27] a part of me thought the complete dvd set was out of print [07:27] waxpancak: I'm more than happy to install things on fos as needed. [07:36] By the way, moving forward with upload of East Village Radio. [07:37] 10,600 hours [08:25] http://www.wrestleview.com/wwe-news/48669-unedited-version-of-tonight-s-5-23-wwe-smackdown-leaks-online [08:54] OK, heading to Stockholm [08:55] Taking bets on in EVR flips out [13:30] anyone have a copy of puahate.com in their stash? it might be down for good now [16:19] just checked, it isnt the webproxy they use: afaik the old ip is http://67.205.13.15/ [16:20] just checking if there is another port they use [17:03] waxpancak: i'll take a look at it later today. which warc file are you extracting? [17:05] chfoo: I started with this one: https://archive.org/details/archiveteam_upcoming_20130425064232 [17:06] It's still crunching on it, running for ten hours straight [17:22] I have a set of videos that I downloaded as part of the Yahoo video archive that I don't think were uploaded. Is it still possible to upload them for archival? [17:36] Certainly [17:36] Also, welcome back! [17:37] Thanks! How do I upload them? [17:38] if they are already compressed in a file, just upload that file to IA, tag it appropriately, and then leave a message here with a link to the file, and a request to have it moved to the proper collection [17:39] if you just have a pile of videos and such, you can either compress it into a single file, or ask SketchCow to provide you with a way to transfer them [17:39] I just have the vidoeas as the download script downloaded them, but I'll compress them and upload them. [17:43] yeah, probably just a tar file would be best [17:51] Ok, I'm tarring it now and when it finishes I'll upload and post the link [21:10] chfoo: After 14 hours, warcat crashed with an OS error. File name too long! http://f.cl.ly/items/0Q2k1h2R0s1W2Z2z3Y3b/andywww1%20warc%20$%20optpython3binp.txt [21:10] daww :( [21:10] that's an error from your OS though [21:10] yeah, not warcat's fault [21:11] I don't think there's any way to resume. Generating indexes for fast resume is on his todo list [21:15] daww2 [21:15] Boop. [21:16] We'll get it right, waxpancak [21:25] https://archive.org/details/archiveteam_archivebot_go_052 [21:25] There's a party [21:30] at least some got freegamemanuals.com [21:30] *someone [21:34] https://archive.org/search.php?query=collection%3Aeastvillageradio&sort=-publicdate [21:34] PARTY [21:38] woooo [21:39] That's some good work. [21:39] It's STILL uploading. [21:39] So, SadDM - don't know if you saw [21:39] A bunch came back, before they deleted them [21:39] https://archive.org/~tracey/mrtg/derivesg.html [21:40] It's surprisingly fast at deriving. [21:40] yes, I saw that. I think the night we started looking at it, there was wonkiness on their end... I was seeing lots of 4xx errors. I'm glad you went back to check. [21:43] Now, sadly, they killed the playlists. [21:43] Good news is I think we have every playlist except this week. [21:44] We'll see. Tomorrow, when Jake comes back, I'll have his script and we'll use it against Wayback. [21:44] Right now, I'm just dumping them in because, come on, 550gb of items. [21:45] shunk shunk shunk shunk [21:45] I'm also going to .tar up all 550gb and make that a separate item. [21:45] Doubles the space but I don't want them to affect it and I want to be able to generate torrents. [21:45] EVR will either flip out, or glorify us. [21:46] first one, then the other ten years down the line [21:46] I do think "Internet Archive put up the entire East Village Radio Archives" could make huge news [21:57] Jake made it so it could do logos, playlists, and the name of the show [21:57] Once that blows in, we'll make subcollections for every show. [21:58] And then, man, basically 5 years of audio (some shows only go back a few months or years, of course) [21:58] There needs to be some kind of UK TV/radio news archive. [21:58] rad [21:58] (sorry apropos of nothing) [21:58] But we're past an actual year, 24/7, of music. [21:58] Suddenly thought about this and super-bummed it doesn't exist [21:58] With djs and all [21:59] We're nowhere near peak digitization [21:59] We're not running out of things to digitize and processes are getting better. [22:01] <3 archive team [22:13] Does IA TV also record satellite TVs? [22:13] * Nemo_bis dreams of Italian channels being included [22:13] i only question satellite tv cause they don't have the blaze in there tv section [22:15] in other news i'm ripping the wisconsin public radio website: http://www.wpr.org/ [22:16] once i get the search mp3 list i can then grab it [22:16] there is also metadata for each of the shows [22:18] antomatic: it does exist: http://bufvc.ac.uk/tvandradio/offair [22:18] (but they only record London, so they won't get regional BBC/ITV news, for example) [22:22] New EVR show gets uploaded every 90 seconds. [22:26] ats_: Ah, thanks! Although "The service is not available to members of the general public." [22:26] once you get all of the set lists and logos up, it will be a fantastic collection. [22:27] yes; it's only for universities that subscribe to it, unfortunately [22:27] What a research tool that'd be if it were publically available, though. [22:27] that's what really caught my interest with EVR (I couldn't care less about the music)... there was some pretty decent metadata [22:30] I wonder what the BBC would do if someone just uploaded a month's worth of their TV and radio news (for example) to the Internet Archive.. [22:31] Maybe other broadcasters too. Try to capture as comprehensive a picture of a month of broadcast news as possible. [22:31] that could be interesting [22:32] if you wanted broadcast news, you would also want to grab Euronews/Al Jazeera/CNN/RT etc. [22:32] That's quite doable, they're all up on satellite and unencrypted, so easy to bulk-record. [22:33] there was a BBC research project about capturing entire DVB multiplexes: https://en.wikipedia.org/wiki/BBC_Redux [22:33] Only the major outlets have subtitling (closed captioning) that would help in the process of indexing, but that could be added later by hand.. [22:33] SketchCow: Crunched a full 25GB megawarc on fos, about five hours for warcat to turn it into 8GB of denormalized extracted HTML/JS/images, in individual directories organized by hostname. Pretty great. [22:33] I do love the idea of things like BBC Redux, but while the end results are just kept in house and not publically available, it's just masturbation. [22:34] if you could find a way to do it legally, you could fund it by selling it as an Ofcom compliance monitoring solution to smaller TV/radio channels... [22:34] Heh. I did consider that once. :) [22:34] I can tell you know something about this! [22:34] no, I just spend too much time reading (a) BBC Research articles and (b) Ofcom broadcast bulletins ;-) [22:35] Ah yes, some channels do have trouble keeping recordings, don't they. :) [22:35] hmmm. I wonder. [22:36] Must sketch out what would be good to record (and how, and from where) and how much hardware would be needed.. [22:36] but, for example, this evening flipping between Euronews and BBC News has been interesting -- it's the EU election results, and the BBC weren't allowed to report on speculation whereas Euronews apparently were... [22:37] The elections are kind of what got me thinking along these lines. It feels like there should be some public database of everything every politician has ever said to a camera or microphone. :) [22:37] but widening the idea to news overall seemed better still. [22:38] waxpancak: So it didn't crash? I thought you said it crashed. [22:38] So, BBC is a VERY special, case, you all know that right? [22:39] special 'hardasses' or special 'public body and can't do a damn thing to stop us' ? :) [22:39] I was running two concurrently, one on your server and one on mine with two different archives [22:39] The one on my server died. [22:39] Got it. [22:39] FOS is superior. [22:39] Yeah, it's easy 3x more powerful than my old Softlayer box [22:40] It gets shit done. [22:40] It's adding metadata to 118,000 items [22:40] though I imagine the long filename issue will happen on either, yours crunches through the archives so much faster, it's not even funny [22:41] I'm starting the second one in a minute [22:41] We've done a lot to FOS to make things happen. [22:41] one of the great things about warcat is that it's denormalizing the files, so it ends up much smaller than the grab even when uncompressed [22:42] Right. [22:42] er, normalizing [22:42] You're an important test case, which needed to happen. [22:42] Yeah, I'll write all this up [22:43] uploading bmxmuseum.com-inf-20140415-235147-dffny.warc.gz: [############### ] 68270/139703 - 00:56:55 [22:43] That's a party too... bmxmuseum getting the love [22:43] It's good fodder for a Kickstarter backup update, but I'm happy to add it to the Archive Team wiki or wherever else you like [22:43] Whatever works. I just like the extraction being a proof case for pulling items back. [22:44] Since through archivebot, we're slamming literally hundreds of sites and millions of URLs in [23:35] antomatic: I'm not exactly sure how it would work in your case, but I know MythTV can handle recording everything on the same multiplex simultaneously with one tuner: http://www.mythtv.org/wiki/Record_multiple_channels_from_one_multiplex [23:37] I've used it because it's occasionally useful- when ClearQAM was more available here, you could sometimes record multiple things at once using one tuner. Unfortunately, it's a bit hard to know what's in the same multiplex without getting your hands dirty