#archiveteam 2014-06-01,Sun

↑back Search

Time Nickname Message
01:22 🔗 exmic I feel like putting huge sites into archivebot, jamming it up for one-off archivals, is making it useless for small emergency grabs
01:22 🔗 exmic does anyone else agree?
01:44 🔗 DFJustin there needs to be an express lane or something for it
01:44 🔗 exmic sure I guess
01:45 🔗 DFJustin otherwise it will always expand to fill capacity and starve new jobs
01:45 🔗 exmic right
01:45 🔗 exmic is what is happening right this moment
01:45 🔗 exmic I have half a mind to start killing huge ancient jobs
01:45 🔗 exmic but I won't
01:47 🔗 DFJustin I would say page the people who started some of the stuff and ask if it's still getting anything of value or likely to finish
01:48 🔗 DFJustin I think there are some fire and forget cases in there
01:48 🔗 DFJustin but there are things like maben or preterhuman where there is a well defined abd valuable set of stuff that just takes a while to get through
01:53 🔗 DFJustin stuff like asiair or mturk it's not clear to me what it's getting or whether it will ever finish
01:54 🔗 DFJustin pixiv I guarantee will not finish because there are 40 million plus images with multiple urls for each
01:55 🔗 * exmic nods
01:55 🔗 DFJustin and I haven't actually seen it retrieve a .jpg yet
03:04 🔗 ivan` http://www.motherjones.com/media/2014/05/internet-archive-wayback-machine-brewster-kahle
03:05 🔗 ivan` missed that
05:32 🔗 voltagex I can't spend any cash but I'd like to help with the justin.tv archiving
05:36 🔗 ivan` join #justouttv
05:41 🔗 voltagex ah, the thread I'm reading says that was still being set up
05:45 🔗 voltagex it seems dead-ish.
07:13 🔗 voltagex go curl, go!
07:13 🔗 exmic curly curl
07:14 🔗 voltagex so exmic, in those result pages, grab the span with class "small black" and you have the user IDs
07:14 🔗 exmic so you do
07:15 🔗 exmic looking in the page src for a numeric user id, not finding what I want
07:15 🔗 voltagex plug the user ID into http://api.justin.tv/api/channel/archives/officecam.json?limit=50 (use limit and offset parameters)
07:15 🔗 voltagex then ???
07:15 🔗 voltagex then we win
07:15 🔗 voltagex no, you won't get numeric IDs
07:15 🔗 exmic what dinguses
07:15 🔗 exmic ooh, an api
07:15 🔗 exmic I like apis
07:15 🔗 voltagex in the JSON I linked, there's userID parameters
07:16 🔗 exmic and direct links to flvs
07:16 🔗 exmic allegedly
07:16 🔗 voltagex this API is particularly fucked.
07:16 🔗 exmic how so?
07:16 🔗 exmic this is like taking candy from a baby
07:16 🔗 voltagex oh, okay then.
07:16 🔗 exmic I mean, what's fucked about it from your perspective?
07:16 🔗 voltagex just having a mental block when you said you needed numeric IDs
07:16 🔗 exmic oh
07:16 🔗 voltagex do you want the listing HTML I grabbed?
07:17 🔗 exmic I like numeric ids because you can use a shell 'for' loop to iterate them :P
07:17 🔗 voltagex question, do you know how to convert to all lowercase in shell?
07:17 🔗 exmic | tr 'A-Z' 'a-z'
07:17 🔗 voltagex oh ffs, that program does everythingh
07:17 🔗 exmic unix!
07:18 🔗 voltagex nope, uniq -u fucked me over. I'm trying to get rid of the duplicates in that pastebin I linked
07:18 🔗 voltagex got it, no -u
07:19 🔗 exmic this one http://pastebin.com/cK1P3dhw ?
07:19 🔗 voltagex yeah
07:19 🔗 exmic ah, I see, duplicates.
07:19 🔗 voltagex I need to go from that listing to a list of curlable URLs. Hm.
07:21 🔗 exmic aye
07:21 🔗 exmic I'm full of coffee but the coffeeshop is closing
07:22 🔗 exmic :P
07:22 🔗 exmic going home, back on in 30-50 minutes
07:22 🔗 voltagex okay, I'm available for
07:22 🔗 voltagex 3ish hours
07:23 🔗 exmic probly sooner but leaving myself time
07:27 🔗 closure while read l end; do for n in $(seq 1 $end); do echo "curl 'http://www.justin.tv/search?q=$l&only=archives&sort-by=count&only=users&page=$n' -o $l$n.html"; done; done
07:27 🔗 closure you'll need to feed it a file containing "a 1000" "b 1500" etc lines
07:27 🔗 voltagex will do, back soon
07:40 🔗 voltagex hey closure, the $1 for the filename doesn't work
07:41 🔗 closure should be a $l not $1
07:41 🔗 closure ie, L
07:41 🔗 closure while read L END; do for N in $(seq 1 $END); do echo "curl 'http://www.justin.tv/search?q=$l&only=archives&sort-by=count&only=users&page=$n' -o $L$N.html"; done; done
07:41 🔗 voltagex yes sorry, it is, still don't work. hold on
07:42 🔗 closure you might need to use bash
07:42 🔗 voltagex I am using bash
07:42 🔗 closure while read L END; do for N in $(seq 1 $END); do echo "curl 'http://www.justin.tv/search?q=$L&only=archives&sort-by=count&only=users&page=$N' -o $L$N.html"; done; done
07:43 🔗 closure works for me, eg if I echo a 10 | to it
07:43 🔗 voltagex I've reformatted for that script at http://pastebin.com/cK1P3dhw
07:44 🔗 voltagex I only get 1.html 2.html etc, which isn't unique
07:50 🔗 voltagex I'm assuming you're working on this. I'm going to grab some food
07:53 🔗 * exmic slides in
07:55 🔗 godane so i have over 226k uniq urls for nbcnews.com i think
07:56 🔗 godane i work on like 5 projects at a time
08:12 🔗 voltagex erm
08:12 🔗 voltagex the list I generated is too big for pastebin
08:14 🔗 exmic heh, makes sense
08:15 🔗 voltagex this is dumb
08:15 🔗 voltagex does someone have a place I can dump 2.5mb of text or a smaller gzip?
08:16 🔗 exmic mail it to me and I can put it on the net somewhere
08:16 🔗 exmic duncan@xrtc.net
08:17 🔗 ZoeB The owner of this one man company's just died. Any chance of someone setting the archive bot on it? http://www.hollowsun.com
08:17 🔗 exmic sure! :D
08:18 🔗 ZoeB Thankyou! ^.^
08:18 🔗 exmic sad to hear the person died though
08:19 🔗 voltagex sent, exmic
08:19 🔗 ZoeB Mhmm. :/
08:19 🔗 exmic rxcvd
08:20 🔗 ZoeB It looks like this is his too: http://www.novachord.co.uk
08:20 🔗 exmic stuffed that in the queue too
08:21 🔗 ZoeB Thank you!
08:21 🔗 exmic my pleasure
08:21 🔗 ZoeB You're the best. :)
08:21 🔗 ZoeB OK, time for breakfast, see you.
08:22 🔗 exmic cheerio
08:26 🔗 exmic voltagex: it should be at http://bl-r.com/trx/justin-tv-search-urls.txt.gz
08:29 🔗 voltagex exmic: thanks. Not sure if it's useful for people.
08:29 🔗 * exmic shrugs
08:29 🔗 exmic c'est la vie
08:29 🔗 voltagex I really need someone with a faster connection to pull it down
08:29 🔗 exmic a few thousand text pages shouldn't take that long on any reasonable connection
08:30 🔗 exmic how many bits/sec do you have?
08:30 🔗 voltagex there's reasonable, then there's Australian
08:33 🔗 exmic I mean, yeah
08:34 🔗 exmic wallaby stole your guys internet
08:36 🔗 voltagex lol
08:36 🔗 voltagex downloading it with a canadian VPS now.
08:36 🔗 voltagex >> I think justin.tv returns 200 when it means 500
08:36 🔗 voltagex ffs
08:37 🔗 exmic fucking web coders
08:37 🔗 voltagex hey hey hey I am a web coder :P
08:43 🔗 voltagex 4 hours -_-
08:43 🔗 exmic that's not too bad
08:43 🔗 exmic :]
08:44 🔗 voltagex yes, but it is past the time I will be asleep and I won't be able to hand the results off yet
08:44 🔗 exmic mmm
08:44 🔗 exmic you should go to sleep and clean it up quick in the morning then
08:44 🔗 voltagex could just give you a key to this VPS :P
08:44 🔗 exmic I'm about to hit the sack myself
08:44 🔗 exmic being near 0200
08:44 🔗 voltagex ah, righto
08:44 🔗 voltagex I'll see how far it gets in the next two hours
08:45 🔗 voltagex heh, thanks for reminding me to fix the clock drift on this box :P
08:56 🔗 godane so i found a abc special called unbroke
08:57 🔗 godane all i know is its from 2010-02-24 and will smith is in it
08:57 🔗 godane anyways to #-bs
08:58 🔗 voltagex 310mb of HTML downloaded, 100ish failures
09:26 🔗 godane so that abc special is really from may 29 2009
09:27 🔗 godane the best link about the special sadly: http://muppet.wikia.com/wiki/Un-Broke:_What_You_Need_to_Know_About_Money
09:43 🔗 voltagex some stuff I'm writing to keep track of what I'm doing
09:43 🔗 voltagex https://gist.github.com/voltagex/6067ee19df87dac7072c
11:01 🔗 voltagex hey exmic you there?
11:01 🔗 voltagex also closure
11:35 🔗 closure yesh
11:36 🔗 voltagex closure: download of all search results pages successful :D
11:36 🔗 voltagex closure: except it's on the world's worst VPS and I can barely tar it up
11:38 🔗 closure and here I thought I had the world's worst VPS, after struggling all night with its horrible horrible grub
11:38 🔗 closure otoh, it's all working now
11:40 🔗 voltagex wait, you have to deal with bootloaders?
11:41 🔗 closure well, when I'm setting up a server with half a terabyte of storage, I want to make sure I can get a grub menu if it breaks later
11:44 🔗 voltagex closure: if I gave you 2GB of Justin.TV search results HTML, would it be useful?
11:56 🔗 voltagex choo choo
12:15 🔗 midas so far i have about 1TB of video downloaded
12:20 🔗 voltagex sorry midas, didn't mean to tread on your toes
12:20 🔗 voltagex I'm working on a list of all usernames
12:20 🔗 voltagex >> I'm hoping this wasn't pointless
12:22 🔗 midas no no! please, continue
12:23 🔗 midas because im only grabbing the video's right now
12:24 🔗 voltagex 21813 usernames_unique.txt
12:24 🔗 voltagex if that's correct, 21.8k users with archived videos
12:36 🔗 voltagex FUCL
12:36 🔗 voltagex FUCK*
12:36 🔗 voltagex empty json
12:43 🔗 voltagex who wanted numeric user IDs?
12:48 🔗 voltagex anyone awake? I need a hand
12:51 🔗 voltagex ping midas closure exmic godane
12:55 🔗 schbirid on irc, ask your question / raise your point
12:55 🔗 schbirid dont wait for someone to say "ok, now i am listening"
12:55 🔗 schbirid irc is asynchronous and awesome at that
12:55 🔗 voltagex schbirid: I know, I just don't want the work I've done to go to waste
12:56 🔗 schbirid oh i thought "who wanted numeric user IDs" was a rant at json :D
12:56 🔗 voltagex I'm looking for someone to coordinate more video downloads. I've got to a point where I have a list of channels to grab (with video URLs inside JSON) but I think I've been blacklisted by justin.tv
12:56 🔗 voltagex I let wget run a bit too fast :D
12:56 🔗 schbirid there is a dedicated channel now
12:56 🔗 voltagex schbirid: https://gist.github.com/voltagex/6067ee19df87dac7072c is where I'm up to
12:56 🔗 schbirid #justouttv
12:56 🔗 voltagex yes, I'm there too
12:57 🔗 voltagex but the people helping me earlier were mainly in this channel. Anywho. I have to go very soon
12:57 🔗 schbirid ncie :)
13:32 🔗 g_lined "When you're using the Justintv API like you are, you can get a higher rate limit by sending oauth authenticated requests instead of just normal ones" https://groups.google.com/forum/#!msg/twitch-api/8YHDdNran_A/4cv3l8Adf58J
13:36 🔗 antomatic bleh! most of the .json files that are coming back correctly are literally only 2 bytes long
13:37 🔗 antomatic perhaps that means no archives
13:37 🔗 antomatic let's see
13:38 🔗 antomatic yes, seems so
13:38 🔗 antomatic ok, that cuts down the list still further
13:40 🔗 antomatic 5 seconds between requests seems to work
13:40 🔗 antomatic crap, spoke too soon
13:41 🔗 antomatic doesn't look like a hard ban, though - works if you try again after a bit
13:43 🔗 voltagex antomatic: yes, 2 byte json seems to mean that there's nothing there
13:44 🔗 voltagex night
14:09 🔗 ersi 2 byte jason
14:10 🔗 ersi hurr hurr
17:06 🔗 Nemo_bis I don't remember, do we ever cooperate with http://flossmole.org/ ?
22:57 🔗 lemonkey justin.tv deleting "everything" in one week, twitch doing the same (but not confirmed) https://help.justin.tv/entries/41803380-Changes-to-Video-Archive-System
22:57 🔗 lemonkey (original tweet: https://twitter.com/0xabad1dea/status/473235121899057152)
23:33 🔗 curi twitch is not deleting all the archives
23:34 🔗 curi https://news.ycombinator.com/item?id=7827643
23:34 🔗 curi that guy worked at jtv for 4 years
23:37 🔗 curi btw twitch archive videos get views, at least for popular streams. e.g. you can see view counts here: http://www.twitch.tv/cosmowright/profile/pastBroadcasts

irclogger-viewer