[01:22] I feel like putting huge sites into archivebot, jamming it up for one-off archivals, is making it useless for small emergency grabs [01:22] does anyone else agree? [01:44] there needs to be an express lane or something for it [01:44] sure I guess [01:45] otherwise it will always expand to fill capacity and starve new jobs [01:45] right [01:45] is what is happening right this moment [01:45] I have half a mind to start killing huge ancient jobs [01:45] but I won't [01:47] I would say page the people who started some of the stuff and ask if it's still getting anything of value or likely to finish [01:48] I think there are some fire and forget cases in there [01:48] but there are things like maben or preterhuman where there is a well defined abd valuable set of stuff that just takes a while to get through [01:53] stuff like asiair or mturk it's not clear to me what it's getting or whether it will ever finish [01:54] pixiv I guarantee will not finish because there are 40 million plus images with multiple urls for each [01:55] * exmic nods [01:55] and I haven't actually seen it retrieve a .jpg yet [03:04] http://www.motherjones.com/media/2014/05/internet-archive-wayback-machine-brewster-kahle [03:05] missed that [05:32] I can't spend any cash but I'd like to help with the justin.tv archiving [05:36] join #justouttv [05:41] ah, the thread I'm reading says that was still being set up [05:45] it seems dead-ish. [07:13] go curl, go! [07:13] curly curl [07:14] so exmic, in those result pages, grab the span with class "small black" and you have the user IDs [07:14] so you do [07:15] looking in the page src for a numeric user id, not finding what I want [07:15] plug the user ID into http://api.justin.tv/api/channel/archives/officecam.json?limit=50 (use limit and offset parameters) [07:15] then ??? [07:15] then we win [07:15] no, you won't get numeric IDs [07:15] what dinguses [07:15] ooh, an api [07:15] I like apis [07:15] in the JSON I linked, there's userID parameters [07:16] and direct links to flvs [07:16] allegedly [07:16] this API is particularly fucked. [07:16] how so? [07:16] this is like taking candy from a baby [07:16] oh, okay then. [07:16] I mean, what's fucked about it from your perspective? [07:16] just having a mental block when you said you needed numeric IDs [07:16] oh [07:16] do you want the listing HTML I grabbed? [07:17] I like numeric ids because you can use a shell 'for' loop to iterate them :P [07:17] question, do you know how to convert to all lowercase in shell? [07:17] | tr 'A-Z' 'a-z' [07:17] oh ffs, that program does everythingh [07:17] unix! [07:18] nope, uniq -u fucked me over. I'm trying to get rid of the duplicates in that pastebin I linked [07:18] got it, no -u [07:19] this one http://pastebin.com/cK1P3dhw ? [07:19] yeah [07:19] ah, I see, duplicates. [07:19] I need to go from that listing to a list of curlable URLs. Hm. [07:21] aye [07:21] I'm full of coffee but the coffeeshop is closing [07:22] :P [07:22] going home, back on in 30-50 minutes [07:22] okay, I'm available for [07:22] 3ish hours [07:23] probly sooner but leaving myself time [07:27] while read l end; do for n in $(seq 1 $end); do echo "curl 'http://www.justin.tv/search?q=$l&only=archives&sort-by=count&only=users&page=$n' -o $l$n.html"; done; done [07:27] you'll need to feed it a file containing "a 1000" "b 1500" etc lines [07:27] will do, back soon [07:40] hey closure, the $1 for the filename doesn't work [07:41] should be a $l not $1 [07:41] ie, L [07:41] while read L END; do for N in $(seq 1 $END); do echo "curl 'http://www.justin.tv/search?q=$l&only=archives&sort-by=count&only=users&page=$n' -o $L$N.html"; done; done [07:41] yes sorry, it is, still don't work. hold on [07:42] you might need to use bash [07:42] I am using bash [07:42] while read L END; do for N in $(seq 1 $END); do echo "curl 'http://www.justin.tv/search?q=$L&only=archives&sort-by=count&only=users&page=$N' -o $L$N.html"; done; done [07:43] works for me, eg if I echo a 10 | to it [07:43] I've reformatted for that script at http://pastebin.com/cK1P3dhw [07:44] I only get 1.html 2.html etc, which isn't unique [07:50] I'm assuming you're working on this. I'm going to grab some food [07:53] * exmic slides in [07:55] so i have over 226k uniq urls for nbcnews.com i think [07:56] i work on like 5 projects at a time [08:12] erm [08:12] the list I generated is too big for pastebin [08:14] heh, makes sense [08:15] this is dumb [08:15] does someone have a place I can dump 2.5mb of text or a smaller gzip? [08:16] mail it to me and I can put it on the net somewhere [08:16] duncan@xrtc.net [08:17] The owner of this one man company's just died. Any chance of someone setting the archive bot on it? http://www.hollowsun.com [08:17] sure! :D [08:18] Thankyou! ^.^ [08:18] sad to hear the person died though [08:19] sent, exmic [08:19] Mhmm. :/ [08:19] rxcvd [08:20] It looks like this is his too: http://www.novachord.co.uk [08:20] stuffed that in the queue too [08:21] Thank you! [08:21] my pleasure [08:21] You're the best. :) [08:21] OK, time for breakfast, see you. [08:22] cheerio [08:26] voltagex: it should be at http://bl-r.com/trx/justin-tv-search-urls.txt.gz [08:29] exmic: thanks. Not sure if it's useful for people. [08:29] * exmic shrugs [08:29] c'est la vie [08:29] I really need someone with a faster connection to pull it down [08:29] a few thousand text pages shouldn't take that long on any reasonable connection [08:30] how many bits/sec do you have? [08:30] there's reasonable, then there's Australian [08:33] I mean, yeah [08:34] wallaby stole your guys internet [08:36] lol [08:36] downloading it with a canadian VPS now. [08:36] >> I think justin.tv returns 200 when it means 500 [08:36] ffs [08:37] fucking web coders [08:37] hey hey hey I am a web coder :P [08:43] 4 hours -_- [08:43] that's not too bad [08:43] :] [08:44] yes, but it is past the time I will be asleep and I won't be able to hand the results off yet [08:44] mmm [08:44] you should go to sleep and clean it up quick in the morning then [08:44] could just give you a key to this VPS :P [08:44] I'm about to hit the sack myself [08:44] being near 0200 [08:44] ah, righto [08:44] I'll see how far it gets in the next two hours [08:45] heh, thanks for reminding me to fix the clock drift on this box :P [08:56] so i found a abc special called unbroke [08:57] all i know is its from 2010-02-24 and will smith is in it [08:57] anyways to #-bs [08:58] 310mb of HTML downloaded, 100ish failures [09:26] so that abc special is really from may 29 2009 [09:27] the best link about the special sadly: http://muppet.wikia.com/wiki/Un-Broke:_What_You_Need_to_Know_About_Money [09:43] some stuff I'm writing to keep track of what I'm doing [09:43] https://gist.github.com/voltagex/6067ee19df87dac7072c [11:01] hey exmic you there? [11:01] also closure [11:35] yesh [11:36] closure: download of all search results pages successful :D [11:36] closure: except it's on the world's worst VPS and I can barely tar it up [11:38] and here I thought I had the world's worst VPS, after struggling all night with its horrible horrible grub [11:38] otoh, it's all working now [11:40] wait, you have to deal with bootloaders? [11:41] well, when I'm setting up a server with half a terabyte of storage, I want to make sure I can get a grub menu if it breaks later [11:44] closure: if I gave you 2GB of Justin.TV search results HTML, would it be useful? [11:56] choo choo [12:15] so far i have about 1TB of video downloaded [12:20] sorry midas, didn't mean to tread on your toes [12:20] I'm working on a list of all usernames [12:20] >> I'm hoping this wasn't pointless [12:22] no no! please, continue [12:23] because im only grabbing the video's right now [12:24] 21813 usernames_unique.txt [12:24] if that's correct, 21.8k users with archived videos [12:36] FUCL [12:36] FUCK* [12:36] empty json [12:43] who wanted numeric user IDs? [12:48] anyone awake? I need a hand [12:51] ping midas closure exmic godane [12:55] on irc, ask your question / raise your point [12:55] dont wait for someone to say "ok, now i am listening" [12:55] irc is asynchronous and awesome at that [12:55] schbirid: I know, I just don't want the work I've done to go to waste [12:56] oh i thought "who wanted numeric user IDs" was a rant at json :D [12:56] I'm looking for someone to coordinate more video downloads. I've got to a point where I have a list of channels to grab (with video URLs inside JSON) but I think I've been blacklisted by justin.tv [12:56] I let wget run a bit too fast :D [12:56] there is a dedicated channel now [12:56] schbirid: https://gist.github.com/voltagex/6067ee19df87dac7072c is where I'm up to [12:56] #justouttv [12:56] yes, I'm there too [12:57] but the people helping me earlier were mainly in this channel. Anywho. I have to go very soon [12:57] ncie :) [13:32] "When you're using the Justintv API like you are, you can get a higher rate limit by sending oauth authenticated requests instead of just normal ones" https://groups.google.com/forum/#!msg/twitch-api/8YHDdNran_A/4cv3l8Adf58J [13:36] bleh! most of the .json files that are coming back correctly are literally only 2 bytes long [13:37] perhaps that means no archives [13:37] let's see [13:38] yes, seems so [13:38] ok, that cuts down the list still further [13:40] 5 seconds between requests seems to work [13:40] crap, spoke too soon [13:41] doesn't look like a hard ban, though - works if you try again after a bit [13:43] antomatic: yes, 2 byte json seems to mean that there's nothing there [13:44] night [14:09] 2 byte jason [14:10] hurr hurr [17:06] I don't remember, do we ever cooperate with http://flossmole.org/ ? [22:57] justin.tv deleting "everything" in one week, twitch doing the same (but not confirmed) https://help.justin.tv/entries/41803380-Changes-to-Video-Archive-System [22:57] (original tweet: https://twitter.com/0xabad1dea/status/473235121899057152) [23:33] twitch is not deleting all the archives [23:34] https://news.ycombinator.com/item?id=7827643 [23:34] that guy worked at jtv for 4 years [23:37] btw twitch archive videos get views, at least for popular streams. e.g. you can see view counts here: http://www.twitch.tv/cosmowright/profile/pastBroadcasts