#archiveteam 2014-05-20,Tue

↑back Search

Time Nickname Message
00:01 🔗 SadDM I did some poking around on this earlier in the week
00:06 🔗 SadDM the streams are all stored in urls of the following form: http:////www.eastvillageradio.com/archivedshows/{p}/{f}.mp3
00:06 🔗 SadDM where {f} and {p} are the same as the url parameters to the streaming player
00:08 🔗 SadDM dashcloud: fwiw I sussed that out with chrome's developer tools... so there's that
00:19 🔗 SadDM also, I have all of the World Wide Mash streams... 28GB
00:45 🔗 SketchCow SadDM: So do we think we can do this?
00:45 🔗 SketchCow If someone writes some scripts, I can run them at IA and use the pipe.
00:46 🔗 SadDM They seem to limit downloads to about 500KB/s
00:46 🔗 SadDM but I did that whole show without them limiting me further
00:46 🔗 SadDM I might be doable
00:47 🔗 SadDM scraping out the necesary parameters is a bit of a pain but not terrible
00:47 🔗 SadDM if a couple of people split up the task of generating the URLs and fed them to you... that might be the best bet
00:49 🔗 SketchCow Give me an example of an mp3.
00:49 🔗 SadDM stand by...
00:50 🔗 SadDM http:////www.eastvillageradio.com/archivedshows/1300/1300-18839-20090716.mp3
00:52 🔗 SadDM whoops... there's a couple extra slashes in there for some reason
00:53 🔗 SadDM there are 58 shows, and the one I did took about 13-14 hours
00:53 🔗 SadDM paralellization could speed that up, or they could catch on to you and throttle
01:02 🔗 SketchCow Yes, but --2014-05-20 01:02:12-- http://www.eastvillageradio.com/archivedshows/1300/1300-18839-20090716.mp3
01:02 🔗 SketchCow Resolving www.eastvillageradio.com... 98.129.116.171
01:02 🔗 SketchCow Connecting to www.eastvillageradio.com|98.129.116.171|:80... connected.
01:02 🔗 SketchCow HTTP request sent, awaiting response... 404 Not Found
01:02 🔗 SketchCow 2014-05-20 01:02:12 ERROR 404: Not Found.
01:02 🔗 SadDM really?
01:02 🔗 SadDM I took that right from the list of urls that I downloaded
01:04 🔗 SketchCow Tried it at two different IPs, different mechanisms.
01:04 🔗 SketchCow Nothing
01:04 🔗 SadDM lemme try again
01:05 🔗 SadDM Even stranger... I'm getting a 400: Bad Request
01:06 🔗 SadDM yeah, it definitly seems to be gone now
01:09 🔗 SketchCow 17,390,160 1.16MB/s eta 82s
01:09 🔗 SketchCow I'm getting 1.16MB/s
01:09 🔗 SadDM I just double-checked the streaming player and tried it with a url that I was streaming, but still got a 400
01:09 🔗 SadDM you got it to work?
01:10 🔗 SketchCow yes.
01:10 🔗 SketchCow I could probably stream a good ripper tonight.
01:10 🔗 SketchCow Script a good stream ripper
01:10 🔗 SketchCow sorry
01:11 🔗 SketchCow I was just stunned that John Lennon sold his house to Ringo Starr
01:11 🔗 SadDM :-D
01:11 🔗 SketchCow Ok, this one is on me.
01:11 🔗 SketchCow I can do this.
01:11 🔗 SketchCow The ripper is easy.
01:11 🔗 SadDM do you want me to start feeding youy lists of magic numbers?
01:11 🔗 SketchCow I can acquire those myself.
01:11 🔗 SketchCow No, the problem is I want to come up with a way to turn it into an IA item.
01:11 🔗 SadDM alright then
01:11 🔗 SketchCow I'm THINKING
01:12 🔗 SketchCow Look at this BRAIN
01:12 🔗 SadDM yeah, they have playlists for each show too which would make great metadata for each show stream
01:14 🔗 SketchCow No other reason to do it.
01:15 🔗 SadDM Anything I can do to help?
01:16 🔗 SadDM I'm almost begged for the night, but I could start serious work after the day-job ends tomorrow
01:28 🔗 SketchCow I think someone assisting me with ripping out html tables from the playlists would help.
01:29 🔗 SketchCow I mean, I can do all this, but I have a massive to-do list
01:30 🔗 SadDM OK, I'll start looking into that tomorrow night.
01:32 🔗 SketchCow OK, so, division of labor
01:32 🔗 SketchCow I am going to go ahead and just start taking in mp3s.
01:33 🔗 SketchCow Since it's XXXX-YYYYY-ZZZZZZZZ.mp3 where XXXX is the show id, the resulting pile of ids can be stripped.
01:33 🔗 SketchCow So we can take this mp3 set, and turn them into described items.
01:42 🔗 SketchCow https://archive.org/details/evr_5744-50944-20140513
01:42 🔗 SketchCow Experiment #1
01:46 🔗 SketchCow This actually won't be hard on the suck side
01:59 🔗 SketchCow PROC=$$
01:59 🔗 SketchCow for each in `cat $PROC.showarchive.txt | grep shows/player/main | sed 's/.*p=//g' | sed 's/\".*//g' | sed 's/^/http:\/\/www.eastvillageradio.com\/archivedshows\//g' | sed 's/\&f=/\//g' | sed 's/$/.mp3/g'`
01:59 🔗 SketchCow mv nowplaying* $PROC.showarchive.txt
01:59 🔗 SketchCow wget "$1"
01:59 🔗 SketchCow do
01:59 🔗 SketchCow wget --user-agent="EVR Will Never Die" "$each"
01:59 🔗 SketchCow done
01:59 🔗 SketchCow Turns out it wasn't hard at all.
02:08 🔗 SketchCow Pulling in 6 hours of radio every 4 minutes.
02:19 🔗 SketchCow Looks like 7 simultaneous streams is about the smartest
02:33 🔗 dashcloud wow- that was quick!
03:45 🔗 SketchCow 712 hours grabbed so far.
03:46 🔗 SketchCow So, one month.
11:46 🔗 damongant For anyone with access to the 4chan article -i can't be bothered to sign in - we only archive images for 7 days (I'm the admin of deniableplausibility)
12:50 🔗 SketchCow 1,879 hours grabbed.
12:52 🔗 SketchCow Shows are falling fast!
12:52 🔗 SadDM SketchCow: I'm pulling down the playlists as we speak... I'll parse out the tables tonight
12:52 🔗 SketchCow Actually, sorry, it's actually 3758 hours, 154 days.
12:52 🔗 SketchCow SadDM: Thanks.
12:53 🔗 SketchCow It'll me piles of mp3 with names like:
12:53 🔗 SketchCow 1232-234867-20140429.mp3
12:53 🔗 SadDM I also grabbed the show descriptions and art.
12:53 🔗 SketchCow So, I'm doing an experiment, which is not coming out well.
12:53 🔗 SketchCow I went to a wayback copy of the site, to find the shows now gone
12:53 🔗 SketchCow And not surprisingly, their mp3s are wiped.
12:54 🔗 SketchCow Also, as our wayback archives prove, this whole "playlist here, click here to listen" thing starts up in 2009.
12:54 🔗 SadDM ok, when I get the tables parsed out I'll put them in text files something like 1232-234867-20140429.desc? Something like that seems pretty script friendly to me.
12:55 🔗 SketchCow So while we won't get ALL the shows that were alive on EVR, we do have things going back to the full range of archive the site had.
12:55 🔗 SketchCow SadDM: Definitely do a single one for me to see, and we'll experiment with injecting it into the page.
12:59 🔗 SadDM I'll get that to you some time tonight. It looks like mapping the playlist to the show's id is going to take a tiny bit of work... probably more than I can get ccomplished on my breaks today.
13:00 🔗 SketchCow It's not THAT bad.
13:00 🔗 SketchCow But I agree.
13:00 🔗 SadDM yeah, it's just that the xxx-yyyyy-zzzzzzzzz number isn't in the url
13:00 🔗 SketchCow It is, it's in the "listen"
13:01 🔗 SketchCow So that page has playlist page and listen link
13:01 🔗 SadDM right
13:01 🔗 SadDM so I just need to grab that one piece of data to look up another... not too bad
13:02 🔗 SadDM it's just not *in* the playlist page's url
13:03 🔗 SketchCow So, as expected (?) the fact is, of the "shows" I can download from, they are only the shows that are still around, and then going back as far as the shows were streamed under the "new" system (2009)
13:03 🔗 SketchCow And in some cases, mp3s have been removed regardless, even though it's an active show, so only the last couple of years.
13:05 🔗 SketchCow But it is VERY obviously going to go past 4000 hours of music
13:05 🔗 SketchCow it is very hard to complain
13:14 🔗 SadDM yup... its going to be a nice collection of hipster rage.
16:54 🔗 SadDM SketchCow: for https://archive.org/details/evr_5744-50944-20140513 how did you extract that table? Did you just do a copy & paste? I'm asking because the tables in the html files seem to be several seperate tables juggled into place with javascript.
17:05 🔗 SketchCow I did it by hand as proof.
17:10 🔗 SadDM ugh... I was afraid of that
17:12 🔗 SadDM I'm open to suggestions from *anybody* on how to programatically rip the table from this page: http://www.eastvillageradio.com/shows/playlists.aspx?contentid=1208&showid=511106&list=206717
17:31 🔗 SadDM They miss the step where they call us: http://www.smashingmagazine.com/2014/05/19/last-goodbye-shut-down-failing-product/
17:34 🔗 SketchCow SadDM: I am asking my co-employee at archive.org.
17:37 🔗 SketchCow He wants it.
17:37 🔗 SketchCow You shouldn't work on it. He has it.
17:37 🔗 SketchCow He caused us to pay attention to it, he will eat the pain.
17:38 🔗 SadDM LOL... good enough. I've got thousands more gaming zines to concentrate on anyway.
18:54 🔗 SketchCow SadDM: https://archive.org/details/evr_test_item
18:56 🔗 SadDM Your guy did this? WHat Show and date is it?
19:05 🔗 SketchCow https://archive.org/details/evr_test_item
19:05 🔗 SketchCow Now has logo. He's working on date.
19:05 🔗 SketchCow Logo AND description.
19:20 🔗 SadDM that's looking pretty good
19:21 🔗 SadDM I'd be interested to know how he's (?) re-assembling the set list.
20:00 🔗 SketchCow He's a python genius
20:00 🔗 SketchCow I bet he's just doing parsing
20:01 🔗 SketchCow Like, I bet he's just got an HTML ingestor.
20:03 🔗 SadDM I love that the world is filled with people that are smarter, and have more experience than me.
21:33 🔗 monod helloooooooooooooooo
21:33 🔗 monod I have a request
21:34 🔗 monod Oh, btw, it's not a famous website, I think
21:34 🔗 monod And I'm still browsing it
21:34 🔗 monod So, it might even turn unvaluable
21:34 🔗 monod But, I'd ask if it is possible to save smartphrases.com
21:34 🔗 monod smartphrase.com*
21:35 🔗 monod That's all.
21:35 🔗 monod I don't think it's closing down, but I dunno
21:35 🔗 ivan` archivebot is on it
21:38 🔗 monod Oh my go
21:38 🔗 monod d
21:38 🔗 monod Do you mean you were already archiving that website? Or that you're going to archive it now?
21:38 🔗 monod Or something else?
21:38 🔗 ivan` started it just now
21:38 🔗 ivan` http://archivebot.at.ninjawedding.org:4567/
21:39 🔗 monod I wonder if that website isn't too big!
21:39 🔗 ivan` I very much doubt that :)
21:39 🔗 monod 33666.57 MB
21:39 🔗 monod 33 gigs???
21:39 🔗 monod Oops
21:40 🔗 monod 4.86 MB? o_O
21:40 🔗 monod Is it for real?? XD
21:40 🔗 Smiley so far monod
21:40 🔗 monod Oh
21:41 🔗 monod Guys, couldn't you get some colleges involved in your project? You'd get a lot of bandwidth, e.g.
21:42 🔗 ivan` if you know someone with a spare xeon sitting around in a college please send them our way
21:42 🔗 monod Uhm
21:42 🔗 monod xeon == server? Or what?
21:42 🔗 ivan` intel's server chip
21:43 🔗 monod How does one cost? Also, what about bandwidth? Isn't it uncorrelated to server chips?
21:43 🔗 monod How much* does one cost
21:44 🔗 ivan` sure, you need bandwidth, CPU, memory, and disk
21:44 🔗 ivan` $1100ish for a server? or ~$60/mo on OVH
21:45 🔗 monod That's another question: who has all that archiving capacity? HDD capacity I mean
21:45 🔗 ivan` for archivebot? for all the other projects? whoever here pays for it
21:46 🔗 monod Online storage???
21:46 🔗 ivan` or do you mean who stores everything long-term? that would be archive.org
21:46 🔗 monod Uhm
21:47 🔗 monod I kinda meant: where are all the files being downloaded right now? :) And yeah, also who keeps them in the long-term, to which you already answered
21:47 🔗 ivan` there's being downloaded to an OVH machine in Canada
21:48 🔗 ivan` s/'s/'re/
21:48 🔗 ivan` gah need sleep
21:48 🔗 monod same :D
21:48 🔗 monod Going to get some in minutes ;)
21:48 🔗 monod Anyway, then you re-download from the OVH servers to your "home", @archive.org
21:49 🔗 monod Right?
21:49 🔗 ivan` no, they're uploaded to fos.textfiles.com
21:49 🔗 ivan` from there they make it into an archivebot collection on archive.org
21:50 🔗 ivan` https://archive.org/details/archivebot
21:50 🔗 monod Thanks
22:08 🔗 monod Cya all!
22:12 🔗 SketchCow Where's my hug
22:13 🔗 * exmic points to the door
22:13 🔗 exmic he'll be around shortly
22:13 🔗 * Baljem provides interim SketchCow-hugging services
22:14 🔗 Baljem my rates are exceedingly reasonable, too!
22:16 🔗 exmic they exceed reasonability
22:18 🔗 SketchCow I just love it when someone goes running in with questions.
22:19 🔗 Baljem I was disappointed there wasn't more head-explodey action with that one
22:19 🔗 Baljem that's always the best bit
22:19 🔗 SketchCow I like the ones where someone goes "ok got the minimum amount of information OKAY PEOPLE HERE IS MY GROUND UP REWRITE FOR A COMPLETE OVERHAUL OF THE ARCHIVE TEAM PROCESS"
22:20 🔗 SketchCow There's ossification of procedure and there's not making the same fundamental mistake 4,000 times
22:21 🔗 SketchCow WHY IS THE TRACKER NOT IN RUBY ON RAILS
22:21 🔗 amerrykan this thing you've been doing for years? yeah, it sucks. i re-engineered the entire thing while standing in the shower this morning
22:21 🔗 exmic the only ruby on rails that's acceptable is http://rubylovesyou.com/
22:22 🔗 exmic nsfwish
22:22 🔗 exmic I guess this is getting kind of offtopic
22:23 🔗 SketchCow Or really, really ontopic
22:23 🔗 exmic or that
22:23 🔗 Baljem yeah, see, my rates are nowhere near her rates
22:23 🔗 amerrykan she takes care of all the microsoft boys
22:23 🔗 exmic lol
22:24 🔗 Baljem admittedly my services are limited to hugs, though, so y'know.
22:24 🔗 exmic you know they sell whips, right?
22:27 🔗 amerrykan i really don't want to know what financial domination is, do i
22:28 🔗 SketchCow And now I am playing the latest episode of Veep at +20% speed
22:28 🔗 SketchCow Apparently the Internet Archive and Wayback machine are mentioned.
22:28 🔗 yipdw oh speaking of which I should make sure my DigitalOcean account has enough money
22:28 🔗 yipdw it'd be hilarious if archivebot.at.ninjawedding.org just died
22:28 🔗 exmic yeah, hilarious
22:28 🔗 yipdw all good
22:28 🔗 SketchCow Laff Riot
22:29 🔗 exmic glad that someone is on that

irclogger-viewer