[00:01] I did some poking around on this earlier in the week [00:06] the streams are all stored in urls of the following form: http:////www.eastvillageradio.com/archivedshows/{p}/{f}.mp3 [00:06] where {f} and {p} are the same as the url parameters to the streaming player [00:08] dashcloud: fwiw I sussed that out with chrome's developer tools... so there's that [00:19] also, I have all of the World Wide Mash streams... 28GB [00:45] SadDM: So do we think we can do this? [00:45] If someone writes some scripts, I can run them at IA and use the pipe. [00:46] They seem to limit downloads to about 500KB/s [00:46] but I did that whole show without them limiting me further [00:46] I might be doable [00:47] scraping out the necesary parameters is a bit of a pain but not terrible [00:47] if a couple of people split up the task of generating the URLs and fed them to you... that might be the best bet [00:49] Give me an example of an mp3. [00:49] stand by... [00:50] http:////www.eastvillageradio.com/archivedshows/1300/1300-18839-20090716.mp3 [00:52] whoops... there's a couple extra slashes in there for some reason [00:53] there are 58 shows, and the one I did took about 13-14 hours [00:53] paralellization could speed that up, or they could catch on to you and throttle [01:02] Yes, but --2014-05-20 01:02:12-- http://www.eastvillageradio.com/archivedshows/1300/1300-18839-20090716.mp3 [01:02] Resolving www.eastvillageradio.com... 98.129.116.171 [01:02] Connecting to www.eastvillageradio.com|98.129.116.171|:80... connected. [01:02] HTTP request sent, awaiting response... 404 Not Found [01:02] 2014-05-20 01:02:12 ERROR 404: Not Found. [01:02] really? [01:02] I took that right from the list of urls that I downloaded [01:04] Tried it at two different IPs, different mechanisms. [01:04] Nothing [01:04] lemme try again [01:05] Even stranger... I'm getting a 400: Bad Request [01:06] yeah, it definitly seems to be gone now [01:09] 17,390,160 1.16MB/s eta 82s [01:09] I'm getting 1.16MB/s [01:09] I just double-checked the streaming player and tried it with a url that I was streaming, but still got a 400 [01:09] you got it to work? [01:10] yes. [01:10] I could probably stream a good ripper tonight. [01:10] Script a good stream ripper [01:10] sorry [01:11] I was just stunned that John Lennon sold his house to Ringo Starr [01:11] :-D [01:11] Ok, this one is on me. [01:11] I can do this. [01:11] The ripper is easy. [01:11] do you want me to start feeding youy lists of magic numbers? [01:11] I can acquire those myself. [01:11] No, the problem is I want to come up with a way to turn it into an IA item. [01:11] alright then [01:11] I'm THINKING [01:12] Look at this BRAIN [01:12] yeah, they have playlists for each show too which would make great metadata for each show stream [01:14] No other reason to do it. [01:15] Anything I can do to help? [01:16] I'm almost begged for the night, but I could start serious work after the day-job ends tomorrow [01:28] I think someone assisting me with ripping out html tables from the playlists would help. [01:29] I mean, I can do all this, but I have a massive to-do list [01:30] OK, I'll start looking into that tomorrow night. [01:32] OK, so, division of labor [01:32] I am going to go ahead and just start taking in mp3s. [01:33] Since it's XXXX-YYYYY-ZZZZZZZZ.mp3 where XXXX is the show id, the resulting pile of ids can be stripped. [01:33] So we can take this mp3 set, and turn them into described items. [01:42] https://archive.org/details/evr_5744-50944-20140513 [01:42] Experiment #1 [01:46] This actually won't be hard on the suck side [01:59] PROC=$$ [01:59] for each in `cat $PROC.showarchive.txt | grep shows/player/main | sed 's/.*p=//g' | sed 's/\".*//g' | sed 's/^/http:\/\/www.eastvillageradio.com\/archivedshows\//g' | sed 's/\&f=/\//g' | sed 's/$/.mp3/g'` [01:59] mv nowplaying* $PROC.showarchive.txt [01:59] wget "$1" [01:59] do [01:59] wget --user-agent="EVR Will Never Die" "$each" [01:59] done [01:59] Turns out it wasn't hard at all. [02:08] Pulling in 6 hours of radio every 4 minutes. [02:19] Looks like 7 simultaneous streams is about the smartest [02:33] wow- that was quick! [03:45] 712 hours grabbed so far. [03:46] So, one month. [11:46] For anyone with access to the 4chan article -i can't be bothered to sign in - we only archive images for 7 days (I'm the admin of deniableplausibility) [12:50] 1,879 hours grabbed. [12:52] Shows are falling fast! [12:52] SketchCow: I'm pulling down the playlists as we speak... I'll parse out the tables tonight [12:52] Actually, sorry, it's actually 3758 hours, 154 days. [12:52] SadDM: Thanks. [12:53] It'll me piles of mp3 with names like: [12:53] 1232-234867-20140429.mp3 [12:53] I also grabbed the show descriptions and art. [12:53] So, I'm doing an experiment, which is not coming out well. [12:53] I went to a wayback copy of the site, to find the shows now gone [12:53] And not surprisingly, their mp3s are wiped. [12:54] Also, as our wayback archives prove, this whole "playlist here, click here to listen" thing starts up in 2009. [12:54] ok, when I get the tables parsed out I'll put them in text files something like 1232-234867-20140429.desc? Something like that seems pretty script friendly to me. [12:55] So while we won't get ALL the shows that were alive on EVR, we do have things going back to the full range of archive the site had. [12:55] SadDM: Definitely do a single one for me to see, and we'll experiment with injecting it into the page. [12:59] I'll get that to you some time tonight. It looks like mapping the playlist to the show's id is going to take a tiny bit of work... probably more than I can get ccomplished on my breaks today. [13:00] It's not THAT bad. [13:00] But I agree. [13:00] yeah, it's just that the xxx-yyyyy-zzzzzzzzz number isn't in the url [13:00] It is, it's in the "listen" [13:01] So that page has playlist page and listen link [13:01] right [13:01] so I just need to grab that one piece of data to look up another... not too bad [13:02] it's just not *in* the playlist page's url [13:03] So, as expected (?) the fact is, of the "shows" I can download from, they are only the shows that are still around, and then going back as far as the shows were streamed under the "new" system (2009) [13:03] And in some cases, mp3s have been removed regardless, even though it's an active show, so only the last couple of years. [13:05] But it is VERY obviously going to go past 4000 hours of music [13:05] it is very hard to complain [13:14] yup... its going to be a nice collection of hipster rage. [16:54] SketchCow: for https://archive.org/details/evr_5744-50944-20140513 how did you extract that table? Did you just do a copy & paste? I'm asking because the tables in the html files seem to be several seperate tables juggled into place with javascript. [17:05] I did it by hand as proof. [17:10] ugh... I was afraid of that [17:12] I'm open to suggestions from *anybody* on how to programatically rip the table from this page: http://www.eastvillageradio.com/shows/playlists.aspx?contentid=1208&showid=511106&list=206717 [17:31] They miss the step where they call us: http://www.smashingmagazine.com/2014/05/19/last-goodbye-shut-down-failing-product/ [17:34] SadDM: I am asking my co-employee at archive.org. [17:37] He wants it. [17:37] You shouldn't work on it. He has it. [17:37] He caused us to pay attention to it, he will eat the pain. [17:38] LOL... good enough. I've got thousands more gaming zines to concentrate on anyway. [18:54] SadDM: https://archive.org/details/evr_test_item [18:56] Your guy did this? WHat Show and date is it? [19:05] https://archive.org/details/evr_test_item [19:05] Now has logo. He's working on date. [19:05] Logo AND description. [19:20] that's looking pretty good [19:21] I'd be interested to know how he's (?) re-assembling the set list. [20:00] He's a python genius [20:00] I bet he's just doing parsing [20:01] Like, I bet he's just got an HTML ingestor. [20:03] I love that the world is filled with people that are smarter, and have more experience than me. [21:33] helloooooooooooooooo [21:33] I have a request [21:34] Oh, btw, it's not a famous website, I think [21:34] And I'm still browsing it [21:34] So, it might even turn unvaluable [21:34] But, I'd ask if it is possible to save smartphrases.com [21:34] smartphrase.com* [21:35] That's all. [21:35] I don't think it's closing down, but I dunno [21:35] archivebot is on it [21:38] Oh my go [21:38] d [21:38] Do you mean you were already archiving that website? Or that you're going to archive it now? [21:38] Or something else? [21:38] started it just now [21:38] http://archivebot.at.ninjawedding.org:4567/ [21:39] I wonder if that website isn't too big! [21:39] I very much doubt that :) [21:39] 33666.57 MB [21:39] 33 gigs??? [21:39] Oops [21:40] 4.86 MB? o_O [21:40] Is it for real?? XD [21:40] so far monod [21:40] Oh [21:41] Guys, couldn't you get some colleges involved in your project? You'd get a lot of bandwidth, e.g. [21:42] if you know someone with a spare xeon sitting around in a college please send them our way [21:42] Uhm [21:42] xeon == server? Or what? [21:42] intel's server chip [21:43] How does one cost? Also, what about bandwidth? Isn't it uncorrelated to server chips? [21:43] How much* does one cost [21:44] sure, you need bandwidth, CPU, memory, and disk [21:44] $1100ish for a server? or ~$60/mo on OVH [21:45] That's another question: who has all that archiving capacity? HDD capacity I mean [21:45] for archivebot? for all the other projects? whoever here pays for it [21:46] Online storage??? [21:46] or do you mean who stores everything long-term? that would be archive.org [21:46] Uhm [21:47] I kinda meant: where are all the files being downloaded right now? :) And yeah, also who keeps them in the long-term, to which you already answered [21:47] there's being downloaded to an OVH machine in Canada [21:48] s/'s/'re/ [21:48] gah need sleep [21:48] same :D [21:48] Going to get some in minutes ;) [21:48] Anyway, then you re-download from the OVH servers to your "home", @archive.org [21:49] Right? [21:49] no, they're uploaded to fos.textfiles.com [21:49] from there they make it into an archivebot collection on archive.org [21:50] https://archive.org/details/archivebot [21:50] Thanks [22:08] Cya all! [22:12] Where's my hug [22:13] * exmic points to the door [22:13] he'll be around shortly [22:13] * Baljem provides interim SketchCow-hugging services [22:14] my rates are exceedingly reasonable, too! [22:16] they exceed reasonability [22:18] I just love it when someone goes running in with questions. [22:19] I was disappointed there wasn't more head-explodey action with that one [22:19] that's always the best bit [22:19] I like the ones where someone goes "ok got the minimum amount of information OKAY PEOPLE HERE IS MY GROUND UP REWRITE FOR A COMPLETE OVERHAUL OF THE ARCHIVE TEAM PROCESS" [22:20] There's ossification of procedure and there's not making the same fundamental mistake 4,000 times [22:21] WHY IS THE TRACKER NOT IN RUBY ON RAILS [22:21] this thing you've been doing for years? yeah, it sucks. i re-engineered the entire thing while standing in the shower this morning [22:21] the only ruby on rails that's acceptable is http://rubylovesyou.com/ [22:22] nsfwish [22:22] I guess this is getting kind of offtopic [22:23] Or really, really ontopic [22:23] or that [22:23] yeah, see, my rates are nowhere near her rates [22:23] she takes care of all the microsoft boys [22:23] lol [22:24] admittedly my services are limited to hugs, though, so y'know. [22:24] you know they sell whips, right? [22:27] i really don't want to know what financial domination is, do i [22:28] And now I am playing the latest episode of Veep at +20% speed [22:28] Apparently the Internet Archive and Wayback machine are mentioned. [22:28] oh speaking of which I should make sure my DigitalOcean account has enough money [22:28] it'd be hilarious if archivebot.at.ninjawedding.org just died [22:28] yeah, hilarious [22:28] all good [22:28] Laff Riot [22:29] glad that someone is on that