Time |
Nickname |
Message |
01:22
🔗
|
exmic |
I feel like putting huge sites into archivebot, jamming it up for one-off archivals, is making it useless for small emergency grabs |
01:22
🔗
|
exmic |
does anyone else agree? |
01:44
🔗
|
DFJustin |
there needs to be an express lane or something for it |
01:44
🔗
|
exmic |
sure I guess |
01:45
🔗
|
DFJustin |
otherwise it will always expand to fill capacity and starve new jobs |
01:45
🔗
|
exmic |
right |
01:45
🔗
|
exmic |
is what is happening right this moment |
01:45
🔗
|
exmic |
I have half a mind to start killing huge ancient jobs |
01:45
🔗
|
exmic |
but I won't |
01:47
🔗
|
DFJustin |
I would say page the people who started some of the stuff and ask if it's still getting anything of value or likely to finish |
01:48
🔗
|
DFJustin |
I think there are some fire and forget cases in there |
01:48
🔗
|
DFJustin |
but there are things like maben or preterhuman where there is a well defined abd valuable set of stuff that just takes a while to get through |
01:53
🔗
|
DFJustin |
stuff like asiair or mturk it's not clear to me what it's getting or whether it will ever finish |
01:54
🔗
|
DFJustin |
pixiv I guarantee will not finish because there are 40 million plus images with multiple urls for each |
01:55
🔗
|
* |
exmic nods |
01:55
🔗
|
DFJustin |
and I haven't actually seen it retrieve a .jpg yet |
03:04
🔗
|
ivan` |
http://www.motherjones.com/media/2014/05/internet-archive-wayback-machine-brewster-kahle |
03:05
🔗
|
ivan` |
missed that |
05:32
🔗
|
voltagex |
I can't spend any cash but I'd like to help with the justin.tv archiving |
05:36
🔗
|
ivan` |
join #justouttv |
05:41
🔗
|
voltagex |
ah, the thread I'm reading says that was still being set up |
05:45
🔗
|
voltagex |
it seems dead-ish. |
07:13
🔗
|
voltagex |
go curl, go! |
07:13
🔗
|
exmic |
curly curl |
07:14
🔗
|
voltagex |
so exmic, in those result pages, grab the span with class "small black" and you have the user IDs |
07:14
🔗
|
exmic |
so you do |
07:15
🔗
|
exmic |
looking in the page src for a numeric user id, not finding what I want |
07:15
🔗
|
voltagex |
plug the user ID into http://api.justin.tv/api/channel/archives/officecam.json?limit=50 (use limit and offset parameters) |
07:15
🔗
|
voltagex |
then ??? |
07:15
🔗
|
voltagex |
then we win |
07:15
🔗
|
voltagex |
no, you won't get numeric IDs |
07:15
🔗
|
exmic |
what dinguses |
07:15
🔗
|
exmic |
ooh, an api |
07:15
🔗
|
exmic |
I like apis |
07:15
🔗
|
voltagex |
in the JSON I linked, there's userID parameters |
07:16
🔗
|
exmic |
and direct links to flvs |
07:16
🔗
|
exmic |
allegedly |
07:16
🔗
|
voltagex |
this API is particularly fucked. |
07:16
🔗
|
exmic |
how so? |
07:16
🔗
|
exmic |
this is like taking candy from a baby |
07:16
🔗
|
voltagex |
oh, okay then. |
07:16
🔗
|
exmic |
I mean, what's fucked about it from your perspective? |
07:16
🔗
|
voltagex |
just having a mental block when you said you needed numeric IDs |
07:16
🔗
|
exmic |
oh |
07:16
🔗
|
voltagex |
do you want the listing HTML I grabbed? |
07:17
🔗
|
exmic |
I like numeric ids because you can use a shell 'for' loop to iterate them :P |
07:17
🔗
|
voltagex |
question, do you know how to convert to all lowercase in shell? |
07:17
🔗
|
exmic |
| tr 'A-Z' 'a-z' |
07:17
🔗
|
voltagex |
oh ffs, that program does everythingh |
07:17
🔗
|
exmic |
unix! |
07:18
🔗
|
voltagex |
nope, uniq -u fucked me over. I'm trying to get rid of the duplicates in that pastebin I linked |
07:18
🔗
|
voltagex |
got it, no -u |
07:19
🔗
|
exmic |
this one http://pastebin.com/cK1P3dhw ? |
07:19
🔗
|
voltagex |
yeah |
07:19
🔗
|
exmic |
ah, I see, duplicates. |
07:19
🔗
|
voltagex |
I need to go from that listing to a list of curlable URLs. Hm. |
07:21
🔗
|
exmic |
aye |
07:21
🔗
|
exmic |
I'm full of coffee but the coffeeshop is closing |
07:22
🔗
|
exmic |
:P |
07:22
🔗
|
exmic |
going home, back on in 30-50 minutes |
07:22
🔗
|
voltagex |
okay, I'm available for |
07:22
🔗
|
voltagex |
3ish hours |
07:23
🔗
|
exmic |
probly sooner but leaving myself time |
07:27
🔗
|
closure |
while read l end; do for n in $(seq 1 $end); do echo "curl 'http://www.justin.tv/search?q=$l&only=archives&sort-by=count&only=users&page=$n' -o $l$n.html"; done; done |
07:27
🔗
|
closure |
you'll need to feed it a file containing "a 1000" "b 1500" etc lines |
07:27
🔗
|
voltagex |
will do, back soon |
07:40
🔗
|
voltagex |
hey closure, the $1 for the filename doesn't work |
07:41
🔗
|
closure |
should be a $l not $1 |
07:41
🔗
|
closure |
ie, L |
07:41
🔗
|
closure |
while read L END; do for N in $(seq 1 $END); do echo "curl 'http://www.justin.tv/search?q=$l&only=archives&sort-by=count&only=users&page=$n' -o $L$N.html"; done; done |
07:41
🔗
|
voltagex |
yes sorry, it is, still don't work. hold on |
07:42
🔗
|
closure |
you might need to use bash |
07:42
🔗
|
voltagex |
I am using bash |
07:42
🔗
|
closure |
while read L END; do for N in $(seq 1 $END); do echo "curl 'http://www.justin.tv/search?q=$L&only=archives&sort-by=count&only=users&page=$N' -o $L$N.html"; done; done |
07:43
🔗
|
closure |
works for me, eg if I echo a 10 | to it |
07:43
🔗
|
voltagex |
I've reformatted for that script at http://pastebin.com/cK1P3dhw |
07:44
🔗
|
voltagex |
I only get 1.html 2.html etc, which isn't unique |
07:50
🔗
|
voltagex |
I'm assuming you're working on this. I'm going to grab some food |
07:53
🔗
|
* |
exmic slides in |
07:55
🔗
|
godane |
so i have over 226k uniq urls for nbcnews.com i think |
07:56
🔗
|
godane |
i work on like 5 projects at a time |
08:12
🔗
|
voltagex |
erm |
08:12
🔗
|
voltagex |
the list I generated is too big for pastebin |
08:14
🔗
|
exmic |
heh, makes sense |
08:15
🔗
|
voltagex |
this is dumb |
08:15
🔗
|
voltagex |
does someone have a place I can dump 2.5mb of text or a smaller gzip? |
08:16
🔗
|
exmic |
mail it to me and I can put it on the net somewhere |
08:16
🔗
|
exmic |
duncan@xrtc.net |
08:17
🔗
|
ZoeB |
The owner of this one man company's just died. Any chance of someone setting the archive bot on it? http://www.hollowsun.com |
08:17
🔗
|
exmic |
sure! :D |
08:18
🔗
|
ZoeB |
Thankyou! ^.^ |
08:18
🔗
|
exmic |
sad to hear the person died though |
08:19
🔗
|
voltagex |
sent, exmic |
08:19
🔗
|
ZoeB |
Mhmm. :/ |
08:19
🔗
|
exmic |
rxcvd |
08:20
🔗
|
ZoeB |
It looks like this is his too: http://www.novachord.co.uk |
08:20
🔗
|
exmic |
stuffed that in the queue too |
08:21
🔗
|
ZoeB |
Thank you! |
08:21
🔗
|
exmic |
my pleasure |
08:21
🔗
|
ZoeB |
You're the best. :) |
08:21
🔗
|
ZoeB |
OK, time for breakfast, see you. |
08:22
🔗
|
exmic |
cheerio |
08:26
🔗
|
exmic |
voltagex: it should be at http://bl-r.com/trx/justin-tv-search-urls.txt.gz |
08:29
🔗
|
voltagex |
exmic: thanks. Not sure if it's useful for people. |
08:29
🔗
|
* |
exmic shrugs |
08:29
🔗
|
exmic |
c'est la vie |
08:29
🔗
|
voltagex |
I really need someone with a faster connection to pull it down |
08:29
🔗
|
exmic |
a few thousand text pages shouldn't take that long on any reasonable connection |
08:30
🔗
|
exmic |
how many bits/sec do you have? |
08:30
🔗
|
voltagex |
there's reasonable, then there's Australian |
08:33
🔗
|
exmic |
I mean, yeah |
08:34
🔗
|
exmic |
wallaby stole your guys internet |
08:36
🔗
|
voltagex |
lol |
08:36
🔗
|
voltagex |
downloading it with a canadian VPS now. |
08:36
🔗
|
voltagex |
>> I think justin.tv returns 200 when it means 500 |
08:36
🔗
|
voltagex |
ffs |
08:37
🔗
|
exmic |
fucking web coders |
08:37
🔗
|
voltagex |
hey hey hey I am a web coder :P |
08:43
🔗
|
voltagex |
4 hours -_- |
08:43
🔗
|
exmic |
that's not too bad |
08:43
🔗
|
exmic |
:] |
08:44
🔗
|
voltagex |
yes, but it is past the time I will be asleep and I won't be able to hand the results off yet |
08:44
🔗
|
exmic |
mmm |
08:44
🔗
|
exmic |
you should go to sleep and clean it up quick in the morning then |
08:44
🔗
|
voltagex |
could just give you a key to this VPS :P |
08:44
🔗
|
exmic |
I'm about to hit the sack myself |
08:44
🔗
|
exmic |
being near 0200 |
08:44
🔗
|
voltagex |
ah, righto |
08:44
🔗
|
voltagex |
I'll see how far it gets in the next two hours |
08:45
🔗
|
voltagex |
heh, thanks for reminding me to fix the clock drift on this box :P |
08:56
🔗
|
godane |
so i found a abc special called unbroke |
08:57
🔗
|
godane |
all i know is its from 2010-02-24 and will smith is in it |
08:57
🔗
|
godane |
anyways to #-bs |
08:58
🔗
|
voltagex |
310mb of HTML downloaded, 100ish failures |
09:26
🔗
|
godane |
so that abc special is really from may 29 2009 |
09:27
🔗
|
godane |
the best link about the special sadly: http://muppet.wikia.com/wiki/Un-Broke:_What_You_Need_to_Know_About_Money |
09:43
🔗
|
voltagex |
some stuff I'm writing to keep track of what I'm doing |
09:43
🔗
|
voltagex |
https://gist.github.com/voltagex/6067ee19df87dac7072c |
11:01
🔗
|
voltagex |
hey exmic you there? |
11:01
🔗
|
voltagex |
also closure |
11:35
🔗
|
closure |
yesh |
11:36
🔗
|
voltagex |
closure: download of all search results pages successful :D |
11:36
🔗
|
voltagex |
closure: except it's on the world's worst VPS and I can barely tar it up |
11:38
🔗
|
closure |
and here I thought I had the world's worst VPS, after struggling all night with its horrible horrible grub |
11:38
🔗
|
closure |
otoh, it's all working now |
11:40
🔗
|
voltagex |
wait, you have to deal with bootloaders? |
11:41
🔗
|
closure |
well, when I'm setting up a server with half a terabyte of storage, I want to make sure I can get a grub menu if it breaks later |
11:44
🔗
|
voltagex |
closure: if I gave you 2GB of Justin.TV search results HTML, would it be useful? |
11:56
🔗
|
voltagex |
choo choo |
12:15
🔗
|
midas |
so far i have about 1TB of video downloaded |
12:20
🔗
|
voltagex |
sorry midas, didn't mean to tread on your toes |
12:20
🔗
|
voltagex |
I'm working on a list of all usernames |
12:20
🔗
|
voltagex |
>> I'm hoping this wasn't pointless |
12:22
🔗
|
midas |
no no! please, continue |
12:23
🔗
|
midas |
because im only grabbing the video's right now |
12:24
🔗
|
voltagex |
21813 usernames_unique.txt |
12:24
🔗
|
voltagex |
if that's correct, 21.8k users with archived videos |
12:36
🔗
|
voltagex |
FUCL |
12:36
🔗
|
voltagex |
FUCK* |
12:36
🔗
|
voltagex |
empty json |
12:43
🔗
|
voltagex |
who wanted numeric user IDs? |
12:48
🔗
|
voltagex |
anyone awake? I need a hand |
12:51
🔗
|
voltagex |
ping midas closure exmic godane |
12:55
🔗
|
schbirid |
on irc, ask your question / raise your point |
12:55
🔗
|
schbirid |
dont wait for someone to say "ok, now i am listening" |
12:55
🔗
|
schbirid |
irc is asynchronous and awesome at that |
12:55
🔗
|
voltagex |
schbirid: I know, I just don't want the work I've done to go to waste |
12:56
🔗
|
schbirid |
oh i thought "who wanted numeric user IDs" was a rant at json :D |
12:56
🔗
|
voltagex |
I'm looking for someone to coordinate more video downloads. I've got to a point where I have a list of channels to grab (with video URLs inside JSON) but I think I've been blacklisted by justin.tv |
12:56
🔗
|
voltagex |
I let wget run a bit too fast :D |
12:56
🔗
|
schbirid |
there is a dedicated channel now |
12:56
🔗
|
voltagex |
schbirid: https://gist.github.com/voltagex/6067ee19df87dac7072c is where I'm up to |
12:56
🔗
|
schbirid |
#justouttv |
12:56
🔗
|
voltagex |
yes, I'm there too |
12:57
🔗
|
voltagex |
but the people helping me earlier were mainly in this channel. Anywho. I have to go very soon |
12:57
🔗
|
schbirid |
ncie :) |
13:32
🔗
|
g_lined |
"When you're using the Justintv API like you are, you can get a higher rate limit by sending oauth authenticated requests instead of just normal ones" https://groups.google.com/forum/#!msg/twitch-api/8YHDdNran_A/4cv3l8Adf58J |
13:36
🔗
|
antomatic |
bleh! most of the .json files that are coming back correctly are literally only 2 bytes long |
13:37
🔗
|
antomatic |
perhaps that means no archives |
13:37
🔗
|
antomatic |
let's see |
13:38
🔗
|
antomatic |
yes, seems so |
13:38
🔗
|
antomatic |
ok, that cuts down the list still further |
13:40
🔗
|
antomatic |
5 seconds between requests seems to work |
13:40
🔗
|
antomatic |
crap, spoke too soon |
13:41
🔗
|
antomatic |
doesn't look like a hard ban, though - works if you try again after a bit |
13:43
🔗
|
voltagex |
antomatic: yes, 2 byte json seems to mean that there's nothing there |
13:44
🔗
|
voltagex |
night |
14:09
🔗
|
ersi |
2 byte jason |
14:10
🔗
|
ersi |
hurr hurr |
17:06
🔗
|
Nemo_bis |
I don't remember, do we ever cooperate with http://flossmole.org/ ? |
22:57
🔗
|
lemonkey |
justin.tv deleting "everything" in one week, twitch doing the same (but not confirmed) https://help.justin.tv/entries/41803380-Changes-to-Video-Archive-System |
22:57
🔗
|
lemonkey |
(original tweet: https://twitter.com/0xabad1dea/status/473235121899057152) |
23:33
🔗
|
curi |
twitch is not deleting all the archives |
23:34
🔗
|
curi |
https://news.ycombinator.com/item?id=7827643 |
23:34
🔗
|
curi |
that guy worked at jtv for 4 years |
23:37
🔗
|
curi |
btw twitch archive videos get views, at least for popular streams. e.g. you can see view counts here: http://www.twitch.tv/cosmowright/profile/pastBroadcasts |