#archiveteam 2014-05-25,Sun

↑back Search

Time Nickname Message
00:00 🔗 waxpancak will do, thanks
00:01 🔗 balrog one other thing: I don't have space for such a large file atm, but if you have a smaller file (couple of gigs max) I can test some stuff
00:02 🔗 waxpancak I'm working on these: https://archive.org/details/archiveteam_upcoming
00:02 🔗 waxpancak they're all pretty big though, 25GB each
00:07 🔗 balrog yeah that's a bit big for now :/
00:08 🔗 waxpancak I can use the `Megawarc` utility to restore the megawarc.warc to a TAR of .warc files, then use warctozip to convert them to ZIPs full of files, and then extract those
00:09 🔗 waxpancak I just wonder if they're something more efficient
00:09 🔗 balrog afaict warcat should be able to extract megawarcs
00:09 🔗 waxpancak seems like something that would've come up, trying to reconstruct an Archive Team save into a static archive
00:09 🔗 balrog I can't really try it from here since I'm currently starved for storage :/
00:10 🔗 waxpancak I'll give it a shot, thanks
00:13 🔗 nico waxpancak: good luck
00:13 🔗 balrog keep us updated :)
00:13 🔗 waxpancak have to switch to a server where I can install python3 to try it
00:13 🔗 waxpancak but that's no big deal, crossing my fingers
00:17 🔗 * nico have build python3 into his ~archivebot directory
00:17 🔗 nico debian's version is too old for yipdw's code :)
00:18 🔗 yipdw I should mandate an OS version
00:18 🔗 yipdw it seems that even with Python's abstractions that stuff still comes back to do weird shit
00:19 🔗 nico let's make a coreos image !
00:20 🔗 nico when archivebot was using wget-lua, i had some strange bug seen only by me because the vps was x86_64
00:20 🔗 danneh_ urgh, javascript is weird
00:21 🔗 nico wpull is a real improvement
00:21 🔗 nico danneh_: javascript is php on the client side
00:22 🔗 balrog nico: sounds like undefined C behavious
00:22 🔗 balrog behaviour*
00:23 🔗 balrog which should be fixed :/
00:23 🔗 nico good luck fixing the wget code
00:24 🔗 balrog nico: I've fixed wget code before
00:25 🔗 yipdw less obscure possibility: you run out of memory faster in 64-bit mode
00:25 🔗 yipdw anyway off-topic siren
00:25 🔗 nico without the kernel complaining?
00:30 🔗 danneh_ Alright then, downloading a bunch of these json files: http://h18000.www1.hp.com/cpq-products/quickspecs/soc/soc_80798.json numbered from 80001 to 80798 apparently
00:30 🔗 danneh_ After that I'll scrape those for PDF links and grab those too
00:31 🔗 balrog danneh_: good job :D
00:31 🔗 danneh_ balrog: thanks :D
00:31 🔗 danneh_ tip: don't worry about the JS at all, just open up Chrome's Networking panel and see what it requests from there instead
00:33 🔗 nico look like soc_80000.json is the men
00:33 🔗 nico menu
00:36 🔗 danneh_ hmm, there's also stuff like: http://h18000.www1.hp.com/cpq-products/quickspecs/14907_div/14907_div.json and http://h18000.www1.hp.com/cpq-products/quickspecs/division/10991_div.json scattered around
00:36 🔗 danneh_ Lots of different places for things
00:39 🔗 waxpancak oh man, warcat is THE BEST
00:40 🔗 waxpancak it's tearing through a 25GB megawarc right now. This makes the process *so much simpler*
00:40 🔗 waxpancak ./opt/python3/bin/python3 -m warcat extract upcoming_20130425064232.megawarc.warc.gz --output-dir ./tmp/ --progress
00:42 🔗 dashcloud didn't someone previously grab HP's entire FTP and upload it to IA? Couldn't you just break out all of the PDFs from that dump if it exists? (this is assuming the destination for the PDFs is actually on HP's FTP)
00:42 🔗 waxpancak And it's splitting it by hostname, effectively reconstructing the site (and the assets it linked to) as they were before it died. This is exactly what I wanted.
00:43 🔗 balrog ;-)
00:45 🔗 dashcloud waxpancak: would you mind writing up the process from start to finish so it can be placed on the wiki for others who might like to do the same thing?
00:46 🔗 waxpancak Happy to.
00:46 🔗 dashcloud I don't know of anyone else who's actually done something like this with a WARC before
00:46 🔗 balrog I've extracted WARCs using the unarchiver and it produces basically that sort of output.
00:46 🔗 balrog but those were small warcs that I created, usually.
00:47 🔗 balrog or mobileme warcs
00:47 🔗 waxpancak I acquired the upcoming.org domain back from Yahoo and I'm trying to put the historical archives back at their original URLs
00:47 🔗 Baljem dashcloud: from experience with finding ex-DEC stuff on HP's site, I wouldn't want to bet on it being in the ftp.hp.com dump
00:47 🔗 balrog waxpancak: I'm aware. Good work with that, btw! :)
00:48 🔗 balrog I'm assuming those archives will just be static, correct?
00:48 🔗 waxpancak yeah, just a historical archive with a minimalist design
00:49 🔗 waxpancak I can't thank all of you enough, it was pretty amazing to watch the grab when it happened a year ago
00:49 🔗 godane i'm starting to upload season 14 of the joy of painting
01:00 🔗 dashcloud thank you for being a generally awesome person
01:20 🔗 danneh_ Just to keep up, here's what I'm currently archiving, will add to it as I go find more links (and if you guys find more, ping me!): http://pastebin.com/g3v3fDhh
01:20 🔗 DFJustin <balrog> (FWIW http://wakaba.c3.cx/s/apps/unarchiver has WARC support, since I insisted ;) )
01:20 🔗 balrog DFJustin: any comments on that?
01:20 🔗 DFJustin haha sweet, I was gonna bug him at some point
01:20 🔗 balrog ah :P
01:25 🔗 DFJustin hmm it only seems to work with .warc and not .warc.gz though
01:27 🔗 balrog DFJustin: it should un-gzip the .gz first
01:30 🔗 DFJustin well like doing `lsar blahblah.warc.gz` just shows blahblah.warc
01:30 🔗 DFJustin which is inconvenient
02:45 🔗 waxpancak Warcat is saving everything exactly how I need, but this is going to take *forever*. It's been cranking on a single 25GB .megawarc (one of 142 saved) for the last two hours, but it's only extracted 1.3G of uncompressed files so far
02:46 🔗 waxpancak chfoo: Any advice?
06:43 🔗 godane so i'm starting to grab more 60 minutes rewind episodes
06:56 🔗 SketchCow Shhh, no hugging, archive team is supposed to be mean
06:56 🔗 SketchCow Greets from Copenhagen, soon Sweden
07:02 🔗 godane i just got the cbs state of the union webcast from 2011
07:02 🔗 godane its the 'after show' report of the state of the union
07:02 🔗 godane that was only online i think
07:04 🔗 godane season 16 of the joy of painting is going up
07:06 🔗 SketchCow Godane, there's no way the Joy of Painting will survive.
07:07 🔗 godane ok
07:07 🔗 godane i thought it would
07:08 🔗 godane the guy is sort of dead
07:19 🔗 SketchCow https://www.bobross.com/
07:19 🔗 SketchCow A very active, very profitable, very involved company that still sells the shows.
07:19 🔗 waxpancak SketchCow: I installed Python 3 to my local user directory so I could get warcat running on your server
07:20 🔗 waxpancak it's running unbearably slow on mine, hoping yours is a bit beefier
07:20 🔗 waxpancak if not, I'll have to get some Amazon instances running
07:20 🔗 waxpancak or figure out some other plan
07:21 🔗 godane if thats the case then i will stop
07:21 🔗 waxpancak It's still been cranking along on the first 25GB megawarc, but only processed 2.5GB of data, so who knows. Maybe chfoo will have some tips when he wakes up
07:22 🔗 waxpancak heading to bed, maybe it'll finish overnight
07:25 🔗 SketchCow https://www.bobross.com/gifts.cfm
07:25 🔗 SketchCow $1,625 for entire bob ross series on DVD
07:26 🔗 godane i noticed that
07:27 🔗 godane a part of me thought the complete dvd set was out of print
07:27 🔗 SketchCow waxpancak: I'm more than happy to install things on fos as needed.
07:36 🔗 SketchCow By the way, moving forward with upload of East Village Radio.
07:37 🔗 SketchCow 10,600 hours
08:25 🔗 DFJustin http://www.wrestleview.com/wwe-news/48669-unedited-version-of-tonight-s-5-23-wwe-smackdown-leaks-online
08:54 🔗 SketchCow OK, heading to Stockholm
08:55 🔗 SketchCow Taking bets on in EVR flips out
13:30 🔗 ivan` anyone have a copy of puahate.com in their stash? it might be down for good now
16:19 🔗 midas just checked, it isnt the webproxy they use: afaik the old ip is http://67.205.13.15/
16:20 🔗 midas just checking if there is another port they use
17:03 🔗 chfoo waxpancak: i'll take a look at it later today. which warc file are you extracting?
17:05 🔗 waxpancak chfoo: I started with this one: https://archive.org/details/archiveteam_upcoming_20130425064232
17:06 🔗 waxpancak It's still crunching on it, running for ten hours straight
17:22 🔗 amuck I have a set of videos that I downloaded as part of the Yahoo video archive that I don't think were uploaded. Is it still possible to upload them for archival?
17:36 🔗 dashcloud Certainly
17:36 🔗 dashcloud Also, welcome back!
17:37 🔗 amuck Thanks! How do I upload them?
17:38 🔗 dashcloud if they are already compressed in a file, just upload that file to IA, tag it appropriately, and then leave a message here with a link to the file, and a request to have it moved to the proper collection
17:39 🔗 dashcloud if you just have a pile of videos and such, you can either compress it into a single file, or ask SketchCow to provide you with a way to transfer them
17:39 🔗 amuck I just have the vidoeas as the download script downloaded them, but I'll compress them and upload them.
17:43 🔗 exmic yeah, probably just a tar file would be best
17:51 🔗 amuck Ok, I'm tarring it now and when it finishes I'll upload and post the link
21:10 🔗 waxpancak chfoo: After 14 hours, warcat crashed with an OS error. File name too long! http://f.cl.ly/items/0Q2k1h2R0s1W2Z2z3Y3b/andywww1%20warc%20$%20optpython3binp.txt
21:10 🔗 schbirid daww :(
21:10 🔗 schbirid that's an error from your OS though
21:10 🔗 waxpancak yeah, not warcat's fault
21:11 🔗 waxpancak I don't think there's any way to resume. Generating indexes for fast resume is on his todo list
21:15 🔗 schbirid daww2
21:15 🔗 SketchCow Boop.
21:16 🔗 SketchCow We'll get it right, waxpancak
21:25 🔗 SketchCow https://archive.org/details/archiveteam_archivebot_go_052
21:25 🔗 SketchCow There's a party
21:30 🔗 godane at least some got freegamemanuals.com
21:30 🔗 godane *someone
21:34 🔗 SketchCow https://archive.org/search.php?query=collection%3Aeastvillageradio&sort=-publicdate
21:34 🔗 SketchCow PARTY
21:38 🔗 exmic woooo
21:39 🔗 SadDM That's some good work.
21:39 🔗 SketchCow It's STILL uploading.
21:39 🔗 SketchCow So, SadDM - don't know if you saw
21:39 🔗 SketchCow A bunch came back, before they deleted them
21:39 🔗 Nemo_bis https://archive.org/~tracey/mrtg/derivesg.html
21:40 🔗 Nemo_bis It's surprisingly fast at deriving.
21:40 🔗 SadDM yes, I saw that. I think the night we started looking at it, there was wonkiness on their end... I was seeing lots of 4xx errors. I'm glad you went back to check.
21:43 🔗 SketchCow Now, sadly, they killed the playlists.
21:43 🔗 SketchCow Good news is I think we have every playlist except this week.
21:44 🔗 SketchCow We'll see. Tomorrow, when Jake comes back, I'll have his script and we'll use it against Wayback.
21:44 🔗 SketchCow Right now, I'm just dumping them in because, come on, 550gb of items.
21:45 🔗 exmic shunk shunk shunk shunk
21:45 🔗 SketchCow I'm also going to .tar up all 550gb and make that a separate item.
21:45 🔗 SketchCow Doubles the space but I don't want them to affect it and I want to be able to generate torrents.
21:45 🔗 SketchCow EVR will either flip out, or glorify us.
21:46 🔗 exmic first one, then the other ten years down the line
21:46 🔗 SketchCow I do think "Internet Archive put up the entire East Village Radio Archives" could make huge news
21:57 🔗 SketchCow Jake made it so it could do logos, playlists, and the name of the show
21:57 🔗 SketchCow Once that blows in, we'll make subcollections for every show.
21:58 🔗 SketchCow And then, man, basically 5 years of audio (some shows only go back a few months or years, of course)
21:58 🔗 antomatic There needs to be some kind of UK TV/radio news archive.
21:58 🔗 exmic rad
21:58 🔗 antomatic (sorry apropos of nothing)
21:58 🔗 SketchCow But we're past an actual year, 24/7, of music.
21:58 🔗 antomatic Suddenly thought about this and super-bummed it doesn't exist
21:58 🔗 SketchCow With djs and all
21:59 🔗 SketchCow We're nowhere near peak digitization
21:59 🔗 SketchCow We're not running out of things to digitize and processes are getting better.
22:01 🔗 amerrykan <3 archive team
22:13 🔗 Nemo_bis Does IA TV also record satellite TVs?
22:13 🔗 * Nemo_bis dreams of Italian channels being included
22:13 🔗 godane i only question satellite tv cause they don't have the blaze in there tv section
22:15 🔗 godane in other news i'm ripping the wisconsin public radio website: http://www.wpr.org/
22:16 🔗 godane once i get the search mp3 list i can then grab it
22:16 🔗 godane there is also metadata for each of the shows
22:18 🔗 ats_ antomatic: it does exist: http://bufvc.ac.uk/tvandradio/offair
22:18 🔗 ats_ (but they only record London, so they won't get regional BBC/ITV news, for example)
22:22 🔗 SketchCow New EVR show gets uploaded every 90 seconds.
22:26 🔗 antomatic ats_: Ah, thanks! Although "The service is not available to members of the general public."
22:26 🔗 SadDM once you get all of the set lists and logos up, it will be a fantastic collection.
22:27 🔗 ats_ yes; it's only for universities that subscribe to it, unfortunately
22:27 🔗 antomatic What a research tool that'd be if it were publically available, though.
22:27 🔗 SadDM that's what really caught my interest with EVR (I couldn't care less about the music)... there was some pretty decent metadata
22:30 🔗 antomatic I wonder what the BBC would do if someone just uploaded a month's worth of their TV and radio news (for example) to the Internet Archive..
22:31 🔗 antomatic Maybe other broadcasters too. Try to capture as comprehensive a picture of a month of broadcast news as possible.
22:31 🔗 exmic that could be interesting
22:32 🔗 ats_ if you wanted broadcast news, you would also want to grab Euronews/Al Jazeera/CNN/RT etc.
22:32 🔗 antomatic That's quite doable, they're all up on satellite and unencrypted, so easy to bulk-record.
22:33 🔗 ats_ there was a BBC research project about capturing entire DVB multiplexes: https://en.wikipedia.org/wiki/BBC_Redux
22:33 🔗 antomatic Only the major outlets have subtitling (closed captioning) that would help in the process of indexing, but that could be added later by hand..
22:33 🔗 waxpancak SketchCow: Crunched a full 25GB megawarc on fos, about five hours for warcat to turn it into 8GB of denormalized extracted HTML/JS/images, in individual directories organized by hostname. Pretty great.
22:33 🔗 antomatic I do love the idea of things like BBC Redux, but while the end results are just kept in house and not publically available, it's just masturbation.
22:34 🔗 ats_ if you could find a way to do it legally, you could fund it by selling it as an Ofcom compliance monitoring solution to smaller TV/radio channels...
22:34 🔗 antomatic Heh. I did consider that once. :)
22:34 🔗 antomatic I can tell you know something about this!
22:34 🔗 ats_ no, I just spend too much time reading (a) BBC Research articles and (b) Ofcom broadcast bulletins ;-)
22:35 🔗 antomatic Ah yes, some channels do have trouble keeping recordings, don't they. :)
22:35 🔗 antomatic hmmm. I wonder.
22:36 🔗 antomatic Must sketch out what would be good to record (and how, and from where) and how much hardware would be needed..
22:36 🔗 ats_ but, for example, this evening flipping between Euronews and BBC News has been interesting -- it's the EU election results, and the BBC weren't allowed to report on speculation whereas Euronews apparently were...
22:37 🔗 antomatic The elections are kind of what got me thinking along these lines. It feels like there should be some public database of everything every politician has ever said to a camera or microphone. :)
22:37 🔗 antomatic but widening the idea to news overall seemed better still.
22:38 🔗 SketchCow waxpancak: So it didn't crash? I thought you said it crashed.
22:38 🔗 SketchCow So, BBC is a VERY special, case, you all know that right?
22:39 🔗 antomatic special 'hardasses' or special 'public body and can't do a damn thing to stop us' ? :)
22:39 🔗 waxpancak I was running two concurrently, one on your server and one on mine with two different archives
22:39 🔗 waxpancak The one on my server died.
22:39 🔗 SketchCow Got it.
22:39 🔗 SketchCow FOS is superior.
22:39 🔗 waxpancak Yeah, it's easy 3x more powerful than my old Softlayer box
22:40 🔗 SketchCow It gets shit done.
22:40 🔗 SketchCow It's adding metadata to 118,000 items
22:40 🔗 waxpancak though I imagine the long filename issue will happen on either, yours crunches through the archives so much faster, it's not even funny
22:41 🔗 waxpancak I'm starting the second one in a minute
22:41 🔗 SketchCow We've done a lot to FOS to make things happen.
22:41 🔗 waxpancak one of the great things about warcat is that it's denormalizing the files, so it ends up much smaller than the grab even when uncompressed
22:42 🔗 SketchCow Right.
22:42 🔗 waxpancak er, normalizing
22:42 🔗 SketchCow You're an important test case, which needed to happen.
22:42 🔗 waxpancak Yeah, I'll write all this up
22:43 🔗 SketchCow uploading bmxmuseum.com-inf-20140415-235147-dffny.warc.gz: [############### ] 68270/139703 - 00:56:55
22:43 🔗 SketchCow That's a party too... bmxmuseum getting the love
22:43 🔗 waxpancak It's good fodder for a Kickstarter backup update, but I'm happy to add it to the Archive Team wiki or wherever else you like
22:43 🔗 SketchCow Whatever works. I just like the extraction being a proof case for pulling items back.
22:44 🔗 SketchCow Since through archivebot, we're slamming literally hundreds of sites and millions of URLs in
23:35 🔗 dashcloud antomatic: I'm not exactly sure how it would work in your case, but I know MythTV can handle recording everything on the same multiplex simultaneously with one tuner: http://www.mythtv.org/wiki/Record_multiple_channels_from_one_multiplex
23:37 🔗 dashcloud I've used it because it's occasionally useful- when ClearQAM was more available here, you could sometimes record multiple things at once using one tuner. Unfortunately, it's a bit hard to know what's in the same multiplex without getting your hands dirty

irclogger-viewer