#archiveteam 2014-05-25,Sun

↑back Search

Time	Nickname	Message
00:00 ^🔗	waxpancak	will do, thanks
00:01 ^🔗	balrog	one other thing: I don't have space for such a large file atm, but if you have a smaller file (couple of gigs max) I can test some stuff
00:02 ^🔗	waxpancak	I'm working on these: https://archive.org/details/archiveteam_upcoming
00:02 ^🔗	waxpancak	they're all pretty big though, 25GB each
00:07 ^🔗	balrog	yeah that's a bit big for now :/
00:08 ^🔗	waxpancak	I can use the `Megawarc` utility to restore the megawarc.warc to a TAR of .warc files, then use warctozip to convert them to ZIPs full of files, and then extract those
00:09 ^🔗	waxpancak	I just wonder if they're something more efficient
00:09 ^🔗	balrog	afaict warcat should be able to extract megawarcs
00:09 ^🔗	waxpancak	seems like something that would've come up, trying to reconstruct an Archive Team save into a static archive
00:09 ^🔗	balrog	I can't really try it from here since I'm currently starved for storage :/
00:10 ^🔗	waxpancak	I'll give it a shot, thanks
00:13 ^🔗	nico	waxpancak: good luck
00:13 ^🔗	balrog	keep us updated :)
00:13 ^🔗	waxpancak	have to switch to a server where I can install python3 to try it
00:13 ^🔗	waxpancak	but that's no big deal, crossing my fingers
00:17 ^🔗	*	nico have build python3 into his ~archivebot directory
00:17 ^🔗	nico	debian's version is too old for yipdw's code :)
00:18 ^🔗	yipdw	I should mandate an OS version
00:18 ^🔗	yipdw	it seems that even with Python's abstractions that stuff still comes back to do weird shit
00:19 ^🔗	nico	let's make a coreos image !
00:20 ^🔗	nico	when archivebot was using wget-lua, i had some strange bug seen only by me because the vps was x86_64
00:20 ^🔗	danneh_	urgh, javascript is weird
00:21 ^🔗	nico	wpull is a real improvement
00:21 ^🔗	nico	danneh_: javascript is php on the client side
00:22 ^🔗	balrog	nico: sounds like undefined C behavious
00:22 ^🔗	balrog	behaviour*
00:23 ^🔗	balrog	which should be fixed :/
00:23 ^🔗	nico	good luck fixing the wget code
00:24 ^🔗	balrog	nico: I've fixed wget code before
00:25 ^🔗	yipdw	less obscure possibility: you run out of memory faster in 64-bit mode
00:25 ^🔗	yipdw	anyway off-topic siren
00:25 ^🔗	nico	without the kernel complaining?
00:30 ^🔗	danneh_	Alright then, downloading a bunch of these json files: http://h18000.www1.hp.com/cpq-products/quickspecs/soc/soc_80798.json numbered from 80001 to 80798 apparently
00:30 ^🔗	danneh_	After that I'll scrape those for PDF links and grab those too
00:31 ^🔗	balrog	danneh_: good job :D
00:31 ^🔗	danneh_	balrog: thanks :D
00:31 ^🔗	danneh_	tip: don't worry about the JS at all, just open up Chrome's Networking panel and see what it requests from there instead
00:33 ^🔗	nico	look like soc_80000.json is the men
00:33 ^🔗	nico	menu
00:36 ^🔗	danneh_	hmm, there's also stuff like: http://h18000.www1.hp.com/cpq-products/quickspecs/14907_div/14907_div.json and http://h18000.www1.hp.com/cpq-products/quickspecs/division/10991_div.json scattered around
00:36 ^🔗	danneh_	Lots of different places for things
00:39 ^🔗	waxpancak	oh man, warcat is THE BEST
00:40 ^🔗	waxpancak	it's tearing through a 25GB megawarc right now. This makes the process so much simpler
00:40 ^🔗	waxpancak	./opt/python3/bin/python3 -m warcat extract upcoming_20130425064232.megawarc.warc.gz --output-dir ./tmp/ --progress
00:42 ^🔗	dashcloud	didn't someone previously grab HP's entire FTP and upload it to IA? Couldn't you just break out all of the PDFs from that dump if it exists? (this is assuming the destination for the PDFs is actually on HP's FTP)
00:42 ^🔗	waxpancak	And it's splitting it by hostname, effectively reconstructing the site (and the assets it linked to) as they were before it died. This is exactly what I wanted.
00:43 ^🔗	balrog	;-)
00:45 ^🔗	dashcloud	waxpancak: would you mind writing up the process from start to finish so it can be placed on the wiki for others who might like to do the same thing?
00:46 ^🔗	waxpancak	Happy to.
00:46 ^🔗	dashcloud	I don't know of anyone else who's actually done something like this with a WARC before
00:46 ^🔗	balrog	I've extracted WARCs using the unarchiver and it produces basically that sort of output.
00:46 ^🔗	balrog	but those were small warcs that I created, usually.
00:47 ^🔗	balrog	or mobileme warcs
00:47 ^🔗	waxpancak	I acquired the upcoming.org domain back from Yahoo and I'm trying to put the historical archives back at their original URLs
00:47 ^🔗	Baljem	dashcloud: from experience with finding ex-DEC stuff on HP's site, I wouldn't want to bet on it being in the ftp.hp.com dump
00:47 ^🔗	balrog	waxpancak: I'm aware. Good work with that, btw! :)
00:48 ^🔗	balrog	I'm assuming those archives will just be static, correct?
00:48 ^🔗	waxpancak	yeah, just a historical archive with a minimalist design
00:49 ^🔗	waxpancak	I can't thank all of you enough, it was pretty amazing to watch the grab when it happened a year ago
00:49 ^🔗	godane	i'm starting to upload season 14 of the joy of painting
01:00 ^🔗	dashcloud	thank you for being a generally awesome person
01:20 ^🔗	danneh_	Just to keep up, here's what I'm currently archiving, will add to it as I go find more links (and if you guys find more, ping me!): http://pastebin.com/g3v3fDhh
01:20 ^🔗	DFJustin	<balrog> (FWIW http://wakaba.c3.cx/s/apps/unarchiver has WARC support, since I insisted ;) )
01:20 ^🔗	balrog	DFJustin: any comments on that?
01:20 ^🔗	DFJustin	haha sweet, I was gonna bug him at some point
01:20 ^🔗	balrog	ah :P
01:25 ^🔗	DFJustin	hmm it only seems to work with .warc and not .warc.gz though
01:27 ^🔗	balrog	DFJustin: it should un-gzip the .gz first
01:30 ^🔗	DFJustin	well like doing `lsar blahblah.warc.gz` just shows blahblah.warc
01:30 ^🔗	DFJustin	which is inconvenient
02:45 ^🔗	waxpancak	Warcat is saving everything exactly how I need, but this is going to take forever. It's been cranking on a single 25GB .megawarc (one of 142 saved) for the last two hours, but it's only extracted 1.3G of uncompressed files so far
02:46 ^🔗	waxpancak	chfoo: Any advice?
06:43 ^🔗	godane	so i'm starting to grab more 60 minutes rewind episodes
06:56 ^🔗	SketchCow	Shhh, no hugging, archive team is supposed to be mean
06:56 ^🔗	SketchCow	Greets from Copenhagen, soon Sweden
07:02 ^🔗	godane	i just got the cbs state of the union webcast from 2011
07:02 ^🔗	godane	its the 'after show' report of the state of the union
07:02 ^🔗	godane	that was only online i think
07:04 ^🔗	godane	season 16 of the joy of painting is going up
07:06 ^🔗	SketchCow	Godane, there's no way the Joy of Painting will survive.
07:07 ^🔗	godane	ok
07:07 ^🔗	godane	i thought it would
07:08 ^🔗	godane	the guy is sort of dead
07:19 ^🔗	SketchCow	https://www.bobross.com/
07:19 ^🔗	SketchCow	A very active, very profitable, very involved company that still sells the shows.
07:19 ^🔗	waxpancak	SketchCow: I installed Python 3 to my local user directory so I could get warcat running on your server
07:20 ^🔗	waxpancak	it's running unbearably slow on mine, hoping yours is a bit beefier
07:20 ^🔗	waxpancak	if not, I'll have to get some Amazon instances running
07:20 ^🔗	waxpancak	or figure out some other plan
07:21 ^🔗	godane	if thats the case then i will stop
07:21 ^🔗	waxpancak	It's still been cranking along on the first 25GB megawarc, but only processed 2.5GB of data, so who knows. Maybe chfoo will have some tips when he wakes up
07:22 ^🔗	waxpancak	heading to bed, maybe it'll finish overnight
07:25 ^🔗	SketchCow	https://www.bobross.com/gifts.cfm
07:25 ^🔗	SketchCow	$1,625 for entire bob ross series on DVD
07:26 ^🔗	godane	i noticed that
07:27 ^🔗	godane	a part of me thought the complete dvd set was out of print
07:27 ^🔗	SketchCow	waxpancak: I'm more than happy to install things on fos as needed.
07:36 ^🔗	SketchCow	By the way, moving forward with upload of East Village Radio.
07:37 ^🔗	SketchCow	10,600 hours
08:25 ^🔗	DFJustin	http://www.wrestleview.com/wwe-news/48669-unedited-version-of-tonight-s-5-23-wwe-smackdown-leaks-online
08:54 ^🔗	SketchCow	OK, heading to Stockholm
08:55 ^🔗	SketchCow	Taking bets on in EVR flips out
13:30 ^🔗	ivan`	anyone have a copy of puahate.com in their stash? it might be down for good now
16:19 ^🔗	midas	just checked, it isnt the webproxy they use: afaik the old ip is http://67.205.13.15/
16:20 ^🔗	midas	just checking if there is another port they use
17:03 ^🔗	chfoo	waxpancak: i'll take a look at it later today. which warc file are you extracting?
17:05 ^🔗	waxpancak	chfoo: I started with this one: https://archive.org/details/archiveteam_upcoming_20130425064232
17:06 ^🔗	waxpancak	It's still crunching on it, running for ten hours straight
17:22 ^🔗	amuck	I have a set of videos that I downloaded as part of the Yahoo video archive that I don't think were uploaded. Is it still possible to upload them for archival?
17:36 ^🔗	dashcloud	Certainly
17:36 ^🔗	dashcloud	Also, welcome back!
17:37 ^🔗	amuck	Thanks! How do I upload them?
17:38 ^🔗	dashcloud	if they are already compressed in a file, just upload that file to IA, tag it appropriately, and then leave a message here with a link to the file, and a request to have it moved to the proper collection
17:39 ^🔗	dashcloud	if you just have a pile of videos and such, you can either compress it into a single file, or ask SketchCow to provide you with a way to transfer them
17:39 ^🔗	amuck	I just have the vidoeas as the download script downloaded them, but I'll compress them and upload them.
17:43 ^🔗	exmic	yeah, probably just a tar file would be best
17:51 ^🔗	amuck	Ok, I'm tarring it now and when it finishes I'll upload and post the link
21:10 ^🔗	waxpancak	chfoo: After 14 hours, warcat crashed with an OS error. File name too long! http://f.cl.ly/items/0Q2k1h2R0s1W2Z2z3Y3b/andywww1%20warc%20$%20optpython3binp.txt
21:10 ^🔗	schbirid	daww :(
21:10 ^🔗	schbirid	that's an error from your OS though
21:10 ^🔗	waxpancak	yeah, not warcat's fault
21:11 ^🔗	waxpancak	I don't think there's any way to resume. Generating indexes for fast resume is on his todo list
21:15 ^🔗	schbirid	daww2
21:15 ^🔗	SketchCow	Boop.
21:16 ^🔗	SketchCow	We'll get it right, waxpancak
21:25 ^🔗	SketchCow	https://archive.org/details/archiveteam_archivebot_go_052
21:25 ^🔗	SketchCow	There's a party
21:30 ^🔗	godane	at least some got freegamemanuals.com
21:30 ^🔗	godane	*someone
21:34 ^🔗	SketchCow	https://archive.org/search.php?query=collection%3Aeastvillageradio&sort=-publicdate
21:34 ^🔗	SketchCow	PARTY
21:38 ^🔗	exmic	woooo
21:39 ^🔗	SadDM	That's some good work.
21:39 ^🔗	SketchCow	It's STILL uploading.
21:39 ^🔗	SketchCow	So, SadDM - don't know if you saw
21:39 ^🔗	SketchCow	A bunch came back, before they deleted them
21:39 ^🔗	Nemo_bis	https://archive.org/~tracey/mrtg/derivesg.html
21:40 ^🔗	Nemo_bis	It's surprisingly fast at deriving.
21:40 ^🔗	SadDM	yes, I saw that. I think the night we started looking at it, there was wonkiness on their end... I was seeing lots of 4xx errors. I'm glad you went back to check.
21:43 ^🔗	SketchCow	Now, sadly, they killed the playlists.
21:43 ^🔗	SketchCow	Good news is I think we have every playlist except this week.
21:44 ^🔗	SketchCow	We'll see. Tomorrow, when Jake comes back, I'll have his script and we'll use it against Wayback.
21:44 ^🔗	SketchCow	Right now, I'm just dumping them in because, come on, 550gb of items.
21:45 ^🔗	exmic	shunk shunk shunk shunk
21:45 ^🔗	SketchCow	I'm also going to .tar up all 550gb and make that a separate item.
21:45 ^🔗	SketchCow	Doubles the space but I don't want them to affect it and I want to be able to generate torrents.
21:45 ^🔗	SketchCow	EVR will either flip out, or glorify us.
21:46 ^🔗	exmic	first one, then the other ten years down the line
21:46 ^🔗	SketchCow	I do think "Internet Archive put up the entire East Village Radio Archives" could make huge news
21:57 ^🔗	SketchCow	Jake made it so it could do logos, playlists, and the name of the show
21:57 ^🔗	SketchCow	Once that blows in, we'll make subcollections for every show.
21:58 ^🔗	SketchCow	And then, man, basically 5 years of audio (some shows only go back a few months or years, of course)
21:58 ^🔗	antomatic	There needs to be some kind of UK TV/radio news archive.
21:58 ^🔗	exmic	rad
21:58 ^🔗	antomatic	(sorry apropos of nothing)
21:58 ^🔗	SketchCow	But we're past an actual year, 24/7, of music.
21:58 ^🔗	antomatic	Suddenly thought about this and super-bummed it doesn't exist
21:58 ^🔗	SketchCow	With djs and all
21:59 ^🔗	SketchCow	We're nowhere near peak digitization
21:59 ^🔗	SketchCow	We're not running out of things to digitize and processes are getting better.
22:01 ^🔗	amerrykan	<3 archive team
22:13 ^🔗	Nemo_bis	Does IA TV also record satellite TVs?
22:13 ^🔗	*	Nemo_bis dreams of Italian channels being included
22:13 ^🔗	godane	i only question satellite tv cause they don't have the blaze in there tv section
22:15 ^🔗	godane	in other news i'm ripping the wisconsin public radio website: http://www.wpr.org/
22:16 ^🔗	godane	once i get the search mp3 list i can then grab it
22:16 ^🔗	godane	there is also metadata for each of the shows
22:18 ^🔗	ats_	antomatic: it does exist: http://bufvc.ac.uk/tvandradio/offair
22:18 ^🔗	ats_	(but they only record London, so they won't get regional BBC/ITV news, for example)
22:22 ^🔗	SketchCow	New EVR show gets uploaded every 90 seconds.
22:26 ^🔗	antomatic	ats_: Ah, thanks! Although "The service is not available to members of the general public."
22:26 ^🔗	SadDM	once you get all of the set lists and logos up, it will be a fantastic collection.
22:27 ^🔗	ats_	yes; it's only for universities that subscribe to it, unfortunately
22:27 ^🔗	antomatic	What a research tool that'd be if it were publically available, though.
22:27 ^🔗	SadDM	that's what really caught my interest with EVR (I couldn't care less about the music)... there was some pretty decent metadata
22:30 ^🔗	antomatic	I wonder what the BBC would do if someone just uploaded a month's worth of their TV and radio news (for example) to the Internet Archive..
22:31 ^🔗	antomatic	Maybe other broadcasters too. Try to capture as comprehensive a picture of a month of broadcast news as possible.
22:31 ^🔗	exmic	that could be interesting
22:32 ^🔗	ats_	if you wanted broadcast news, you would also want to grab Euronews/Al Jazeera/CNN/RT etc.
22:32 ^🔗	antomatic	That's quite doable, they're all up on satellite and unencrypted, so easy to bulk-record.
22:33 ^🔗	ats_	there was a BBC research project about capturing entire DVB multiplexes: https://en.wikipedia.org/wiki/BBC_Redux
22:33 ^🔗	antomatic	Only the major outlets have subtitling (closed captioning) that would help in the process of indexing, but that could be added later by hand..
22:33 ^🔗	waxpancak	SketchCow: Crunched a full 25GB megawarc on fos, about five hours for warcat to turn it into 8GB of denormalized extracted HTML/JS/images, in individual directories organized by hostname. Pretty great.
22:33 ^🔗	antomatic	I do love the idea of things like BBC Redux, but while the end results are just kept in house and not publically available, it's just masturbation.
22:34 ^🔗	ats_	if you could find a way to do it legally, you could fund it by selling it as an Ofcom compliance monitoring solution to smaller TV/radio channels...
22:34 ^🔗	antomatic	Heh. I did consider that once. :)
22:34 ^🔗	antomatic	I can tell you know something about this!
22:34 ^🔗	ats_	no, I just spend too much time reading (a) BBC Research articles and (b) Ofcom broadcast bulletins ;-)
22:35 ^🔗	antomatic	Ah yes, some channels do have trouble keeping recordings, don't they. :)
22:35 ^🔗	antomatic	hmmm. I wonder.
22:36 ^🔗	antomatic	Must sketch out what would be good to record (and how, and from where) and how much hardware would be needed..
22:36 ^🔗	ats_	but, for example, this evening flipping between Euronews and BBC News has been interesting -- it's the EU election results, and the BBC weren't allowed to report on speculation whereas Euronews apparently were...
22:37 ^🔗	antomatic	The elections are kind of what got me thinking along these lines. It feels like there should be some public database of everything every politician has ever said to a camera or microphone. :)
22:37 ^🔗	antomatic	but widening the idea to news overall seemed better still.
22:38 ^🔗	SketchCow	waxpancak: So it didn't crash? I thought you said it crashed.
22:38 ^🔗	SketchCow	So, BBC is a VERY special, case, you all know that right?
22:39 ^🔗	antomatic	special 'hardasses' or special 'public body and can't do a damn thing to stop us' ? :)
22:39 ^🔗	waxpancak	I was running two concurrently, one on your server and one on mine with two different archives
22:39 ^🔗	waxpancak	The one on my server died.
22:39 ^🔗	SketchCow	Got it.
22:39 ^🔗	SketchCow	FOS is superior.
22:39 ^🔗	waxpancak	Yeah, it's easy 3x more powerful than my old Softlayer box
22:40 ^🔗	SketchCow	It gets shit done.
22:40 ^🔗	SketchCow	It's adding metadata to 118,000 items
22:40 ^🔗	waxpancak	though I imagine the long filename issue will happen on either, yours crunches through the archives so much faster, it's not even funny
22:41 ^🔗	waxpancak	I'm starting the second one in a minute
22:41 ^🔗	SketchCow	We've done a lot to FOS to make things happen.
22:41 ^🔗	waxpancak	one of the great things about warcat is that it's denormalizing the files, so it ends up much smaller than the grab even when uncompressed
22:42 ^🔗	SketchCow	Right.
22:42 ^🔗	waxpancak	er, normalizing
22:42 ^🔗	SketchCow	You're an important test case, which needed to happen.
22:42 ^🔗	waxpancak	Yeah, I'll write all this up
22:43 ^🔗	SketchCow	uploading bmxmuseum.com-inf-20140415-235147-dffny.warc.gz: [############### ] 68270/139703 - 00:56:55
22:43 ^🔗	SketchCow	That's a party too... bmxmuseum getting the love
22:43 ^🔗	waxpancak	It's good fodder for a Kickstarter backup update, but I'm happy to add it to the Archive Team wiki or wherever else you like
22:43 ^🔗	SketchCow	Whatever works. I just like the extraction being a proof case for pulling items back.
22:44 ^🔗	SketchCow	Since through archivebot, we're slamming literally hundreds of sites and millions of URLs in
23:35 ^🔗	dashcloud	antomatic: I'm not exactly sure how it would work in your case, but I know MythTV can handle recording everything on the same multiplex simultaneously with one tuner: http://www.mythtv.org/wiki/Record_multiple_channels_from_one_multiplex
23:37 ^🔗	dashcloud	I've used it because it's occasionally useful- when ClearQAM was more available here, you could sometimes record multiple things at once using one tuner. Unfortunately, it's a bit hard to know what's in the same multiplex without getting your hands dirty

irclogger-viewer