Time |
Nickname |
Message |
11:03
🔗
|
arkiver |
anyone else getting error 500 with S3 all the time? |
11:20
🔗
|
Nemo_bis |
arkiver: have you checked that https://archive.org/catalog.php?history=1&justme=1 looks ok for you? |
11:20
🔗
|
arkiver |
Nemo_bis: 101 waiting to run.., |
11:21
🔗
|
arkiver |
is that why S3 is doing weird? |
11:21
🔗
|
Nemo_bis |
I've had s3 reject my uploads when I had too many in the queue sometimes |
11:21
🔗
|
Nemo_bis |
that may be the reason or not |
11:22
🔗
|
arkiver |
ah |
11:22
🔗
|
arkiver |
maybe it is |
11:22
🔗
|
arkiver |
https://catalogd.archive.org/log/316904585 |
11:22
🔗
|
arkiver |
I got 6 of these errors |
11:22
🔗
|
arkiver |
rerunning one and it looks like it's working, so I'm going to rerun them |
11:27
🔗
|
arkiver |
Nemo_bis: looks like it's working again. Thank you! |
11:36
🔗
|
midas |
s3 had alot of slowdowns, probably just going mental for a bit |
12:22
🔗
|
midas |
getting some bigass originals from rawporter now |
12:45
🔗
|
arkiver |
http://www.codespaces.com/ |
13:19
🔗
|
Cameron_D |
http://blog.snappytv.com/?p=2101 |
13:19
🔗
|
Cameron_D |
https://blog.twitter.com/2014/snappytv-is-joining-the-flock |
13:21
🔗
|
Cameron_D |
doesn't seem like they really host anything |
13:21
🔗
|
Cameron_D |
(Nor does it seem like they are closing up yet) |
18:36
🔗
|
schbirid |
thanks for cleaning up https://archive.org/details/nycTaxiTripData2013 whoever did that :) |
18:42
🔗
|
DFJustin |
you might want to put a better title/description on it |
18:53
🔗
|
schbirid |
true |
18:54
🔗
|
SN4T14 |
I want to make a shitty "acid trip" joke. :p |
19:00
🔗
|
midas |
about 1000 more items to go in the rawporter grab from S3 |
19:10
🔗
|
swebb |
Pixorial video-, photo-sharing service shutting down - Denver... Giving 30-days notice. http://www.bizjournals.com/denver/news/2014/06/17/pixorial-video-photo-sharing-service-shutting-down.html?ana=twt |
19:19
🔗
|
swebb |
"636,000 users, 2.7 million minutes of video and countless photos" |
19:27
🔗
|
ivan` |
google doesn't seem to know much of their user content, so I'm guessing it's almost entirely private? |
19:27
🔗
|
ivan` |
did spot some "www.pixorial.com/watch/" https://encrypted.google.com/search?q=site:pixorial.com&gbv=1&prmd=ivns&ei=EzmjU5SWA4bxoATxyIKYAg&start=20&sa=N |
19:27
🔗
|
db48x |
signups are disabled |
19:27
🔗
|
db48x |
no information about their api |
19:27
🔗
|
swebb |
Yea, I'm guessing so. It seems like the company is only 6 months old or so. |
19:28
🔗
|
db48x |
they don't actually display any of that content except through embeding |
19:29
🔗
|
db48x |
wikipedia says founded in 2007 and launched in 2009 |
19:30
🔗
|
db48x |
hmm, on the other hand: http://myhub.pixorial.com/watch/636a65d9042aa8416876197a9c44e38b |
19:31
🔗
|
db48x |
<source id="mp4Source" src="http://myhub.pixorial.com/video/636a65d9042aa8416876197a9c44e38b" type="video/mp4"></source><source id="webmSource" src="http://myhub.pixorial.com/video/636a65d9042aa8416876197a9c44e38b.webm" type="video/webm"></source> |
19:51
🔗
|
arkiver |
so for pixorial |
19:51
🔗
|
arkiver |
I think it's the best to just scan all the links like http://pixori.al/26e01 |
19:51
🔗
|
arkiver |
with http://pixori.al/***** |
19:51
🔗
|
arkiver |
* = 0-9 or a-f |
19:52
🔗
|
arkiver |
http://pixori.al/26e01 goes to http://myhub.pixorial.com/watch/636a65d9042aa8416876197a9c44e38b |
19:54
🔗
|
arkiver |
change http://myhub.pixorial.com/watch/636a65d9042aa8416876197a9c44e38b to http://myhub.pixorial.com/video/636a65d9042aa8416876197a9c44e38b and you have video |
19:54
🔗
|
arkiver |
there is mp4 and webm |
19:54
🔗
|
arkiver |
mp4: http://myhub.pixorial.com/video/636a65d9042aa8416876197a9c44e38b.mp4 |
19:54
🔗
|
arkiver |
webm: http://myhub.pixorial.com/video/636a65d9042aa8416876197a9c44e38b.webm |
19:54
🔗
|
exmic |
that's easy enough |
19:54
🔗
|
arkiver |
so I don't think it's that hard to do... |
19:55
🔗
|
arkiver |
just scanning all the http://pixori.al/***** links |
19:55
🔗
|
arkiver |
and you have it :) |
19:55
🔗
|
arkiver |
let's make a channel |
19:56
🔗
|
arkiver |
#pixi-death |
19:56
🔗
|
arkiver |
? :P |
19:58
🔗
|
arkiver |
found s3 again :) just like with earbits |
20:00
🔗
|
db48x |
hah |
20:02
🔗
|
arkiver |
http://pix-wp-assets.s3.amazonaws.com/ |
20:14
🔗
|
midas |
arkiver: http://pastebin.com/TNR3JXZy |
20:14
🔗
|
midas |
open s3 bucket, again |
20:14
🔗
|
* |
db48x revives his aws account |
20:15
🔗
|
midas |
running du on it now |
20:15
🔗
|
db48x |
was about to ask :) |
20:15
🔗
|
midas |
did you find a bucket for the images/movies too arkiver ? |
20:15
🔗
|
SN4T14 |
arkiver, hate to burst your bubble, but that's 60466176 unique addresses |
20:15
🔗
|
midas |
319591281 s3://pix-wp-assets/ |
20:16
🔗
|
midas |
SN4T14: soo? thats not that much |
20:16
🔗
|
db48x |
probably all just wordpress stuff like the name implies |
20:16
🔗
|
midas |
yeah |
20:17
🔗
|
db48x |
wouldn't surprise me if there were others though |
20:17
🔗
|
arkiver |
SN4T14: http://pixori.al/***** is 1048576 adresses... |
20:17
🔗
|
arkiver |
0-9 and e-f |
20:17
🔗
|
schbirid |
next archiveteam project will be "s3cmd ls s3://*" |
20:17
🔗
|
db48x |
heh |
20:17
🔗
|
schbirid |
there were some guys who index that years ago btw |
20:18
🔗
|
schbirid |
based on keywords |
20:18
🔗
|
schbirid |
err, *wordlists |
20:18
🔗
|
SN4T14 |
arkiver, that's 36 letters, 36^5=60466176 |
20:18
🔗
|
midas |
schbirid: you know, would suprise me if that would be next :p |
20:18
🔗
|
midas |
grabbing every s3 bucket that is public |
20:18
🔗
|
SN4T14 |
Oh, a-f |
20:18
🔗
|
SN4T14 |
Whoops |
20:19
🔗
|
arkiver |
hehe |
20:19
🔗
|
SN4T14 |
Yeah, then it's definitely doable |
20:19
🔗
|
arkiver |
running crawl of 1048576 links now |
20:19
🔗
|
arkiver |
see what's coming |
20:19
🔗
|
arkiver |
the videos are embedded in html so that's easy |
20:22
🔗
|
db48x |
even easier; the urls are predicitable so we don't even have to parse the html |
20:24
🔗
|
arkiver |
tested some urls: the urls with 5 things are indeed a-f: http://pixori.al/b532f http://pixori.al/3bcbb http://pixori.al/5a5b2 http://pixori.al/dcc94 http://pixori.al/e5cef http://pixori.al/cd75e |
20:25
🔗
|
arkiver |
looks like there no capital letters on anything |
20:25
🔗
|
arkiver |
but, I found this one too: http://myhub.pixorial.com/watch/045ad22847837e2cc150484f7421e950 |
20:25
🔗
|
arkiver |
it has http://pixori.al/7sAR |
20:25
🔗
|
arkiver |
:/ |
20:26
🔗
|
schbirid |
redirects to the same url |
20:26
🔗
|
schbirid |
as lowercase |
20:26
🔗
|
midas |
nice |
20:27
🔗
|
arkiver |
schbirid: great! |
20:27
🔗
|
arkiver |
so http://pixori.al/7sAR is http://pixori.al/7sar |
20:28
🔗
|
arkiver |
so there are for http://pixori.al/**** 1679616 urls |
20:30
🔗
|
arkiver |
wow. |
20:30
🔗
|
arkiver |
going fast, running crawl on the 1048576 urls from the http://pixori.al/***** and it's going 3 urls per second |
20:30
🔗
|
arkiver |
well, not extremely fast, but better then earbits |
20:30
🔗
|
SN4T14 |
A *whole* 3 urls per second. :p |
20:31
🔗
|
SN4T14 |
Why don't you start multiple instances of your script? |
20:31
🔗
|
arkiver |
SN4T14: will do that |
20:31
🔗
|
arkiver |
will start 5 |
20:32
🔗
|
db48x |
it's more than just 0-9a-f |
20:32
🔗
|
db48x |
alas |
20:32
🔗
|
arkiver |
db48x: do you have an example url? |
20:32
🔗
|
db48x |
arkiver: oh, good, we don't have to deal with capitals as well :) |
20:33
🔗
|
arkiver |
db48x: oh sorry, I see |
20:33
🔗
|
arkiver |
so on http://pixori.al/**** it is 0-9a-z and http://pixori.al/***** is 0-9a-f |
20:34
🔗
|
midas |
Rawporter data is downloaded |
20:34
🔗
|
arkiver |
I tested around 15 urls and they were all up to f from http://pixori.al/***** , but there could be more then f... |
20:34
🔗
|
arkiver |
if anyone finds such a case |
20:34
🔗
|
midas |
f seems logical, being hex |
20:35
🔗
|
midas |
Rawporter ended up being 75GB of data |
20:35
🔗
|
arkiver |
:) |
20:35
🔗
|
arkiver |
o images only? |
20:35
🔗
|
arkiver |
of* |
20:36
🔗
|
db48x |
if the final digit only goes up to f then that's still 26.8 million |
20:36
🔗
|
midas |
images and video |
20:37
🔗
|
arkiver |
db48x: how do you get that number? |
20:37
🔗
|
arkiver |
16^5 is 1048576 |
20:38
🔗
|
db48x |
36^4*16 |
20:39
🔗
|
db48x |
although it could be two separate namespaces, in which case it's 36^4 + 16^5 |
20:39
🔗
|
arkiver |
yes.... |
20:40
🔗
|
arkiver |
it's 36^4 + 16^5, so that's 1679616 + 1048576 = 2728192 urls |
20:40
🔗
|
arkiver |
but maybe also the http://pixori.al/*** , http://pixori.al/** and http://pixori.al/* exist.. |
20:44
🔗
|
schbirid |
http://pixori.al/* always redirects to http://myhub.pixorial.com/s/* which then redirects to the actual URL. you can save one redirect by using the myhub URL directly. saved almost 20% on my simple tiny test |
20:47
🔗
|
schbirid |
ok, the second test just gave ~4% though :P |
20:52
🔗
|
arkiver |
cool |
20:52
🔗
|
arkiver |
Wayback is also playing the videos very well in the browser |
20:52
🔗
|
arkiver |
http://web.archive.org/web/20140619195035/http://myhub.pixorial.com/video/636a65d9042aa8416876197a9c44e38b |
20:55
🔗
|
arkiver |
going to split list of links from http://pixori.al/***** into 5 packs |
20:55
🔗
|
arkiver |
no, 10 packs |
20:55
🔗
|
arkiver |
then download them simultaneously |
20:55
🔗
|
arkiver |
hopefully, they don't have some kind of stupid banning system... |
20:56
🔗
|
arkiver |
will put soTimeoutMs to 20000 milliseconds and timeoutSeconds to 120000 seconds |
20:56
🔗
|
arkiver |
to stay save |
20:57
🔗
|
arkiver |
the videos don't download very fast (I had 50-100 kpbs) for one video |
20:57
🔗
|
arkiver |
so if video is multiple GB's... |
20:57
🔗
|
Smiley |
can you create url lists? |
20:57
🔗
|
Smiley |
damnit we need to get a warrior type project which we can just insert lists of urls, like archivebot does. |
20:57
🔗
|
arkiver |
haha no need man |
20:57
🔗
|
arkiver |
:P |
20:57
🔗
|
arkiver |
we got 30 days |
20:58
🔗
|
arkiver |
going to start 10 simultaneous crawls on http://pixori.al/***** in some time |
20:59
🔗
|
Smiley |
it'd still be a awesomely useful project |
20:59
🔗
|
arkiver |
yep |
21:00
🔗
|
arkiver |
Smiley: http://www.onetimebox.org/box/wi7sHp4t4Qrc26w58 |
21:00
🔗
|
arkiver |
those are the links |
21:02
🔗
|
db48x |
I say we do it with the warrior |
21:03
🔗
|
db48x |
it's a more sure way to go, and if we get it done faster than people are less likely to delete things |
21:03
🔗
|
arkiver |
yeah |
21:04
🔗
|
arkiver |
but I have zero knowledge about the warrior... |
21:04
🔗
|
arkiver |
:/ |
21:04
🔗
|
db48x |
but we should start scanning the urls now, before that is set up |
21:04
🔗
|
Smiley |
do we have some way of feeding the warrior loads of urls? |
21:04
🔗
|
Smiley |
we really need a framework D: |
21:04
🔗
|
db48x |
it's pretty easy |
21:04
🔗
|
arkiver |
db48x: why scanning the urls? |
21:04
🔗
|
Smiley |
like a generic one... |
21:04
🔗
|
arkiver |
why not doing short urls in the warrior? |
21:04
🔗
|
arkiver |
and people just downloading those? |
21:04
🔗
|
db48x |
we could be finished scanning the urls in a day or two |
21:05
🔗
|
arkiver |
yeah |
21:05
🔗
|
arkiver |
but I have now idea how to scan urls... |
21:05
🔗
|
db48x |
the pipeline for the warrior task may take that long to write and test |
21:05
🔗
|
arkiver |
O.o |
21:05
🔗
|
arkiver |
I know how to download and use heritrix and such things |
21:05
🔗
|
arkiver |
but someone else would have to do the short url scanning |
21:05
🔗
|
db48x |
ah |
21:06
🔗
|
arkiver |
sorry |
21:06
🔗
|
arkiver |
:( |
21:06
🔗
|
db48x |
I thought you were just scanning |
21:06
🔗
|
arkiver |
no no |
21:06
🔗
|
arkiver |
I was already downloading |
21:06
🔗
|
arkiver |
3 urls per second |
21:06
🔗
|
arkiver |
would be done in around 5 to 10 days |
21:06
🔗
|
arkiver |
with all the urls |
21:06
🔗
|
arkiver |
But it's more fun with a warrior project :) |
21:06
🔗
|
db48x |
assuming it's only a few million |
21:06
🔗
|
arkiver |
#pixi-death |
21:07
🔗
|
arkiver |
couldn't think of anything better... :P |
21:09
🔗
|
db48x |
#pixofail :D |
21:10
🔗
|
arkiver |
:D muuuuch better |
21:10
🔗
|
arkiver |
:P |
21:15
🔗
|
db48x |
yipdw: ping? |
21:34
🔗
|
yipdw |
db48x: hi |
21:35
🔗
|
db48x |
yipdw: howdy. how do you usually create a new job for the warriors? fork seesaw-kit? |
21:36
🔗
|
yipdw |
yeah |
21:36
🔗
|
yipdw |
I usually also subtree merge the readme repo in, but copy is fine too |
21:38
🔗
|
db48x |
I guess you don't actually use a github fork |
21:38
🔗
|
yipdw |
you can |
22:27
🔗
|
deathy |
wondering if someone studied CKAN archival/backup... it's a platform for "open data". Not that governments collapse and may delete some of their published data...but best be safe... |
22:49
🔗
|
Nemo_bis |
someone claims it's useful to upload data there in order to not have all eggs in IA's basket https://meta.wikimedia.org/wiki/Talk:Requests_for_comment/How_to_deal_with_open_datasets |
22:50
🔗
|
Nemo_bis |
(I simplified) |
22:52
🔗
|
DFJustin |
the more baskets the merrier |
23:05
🔗
|
deathy |
heh.. "lots of copies keeps stuff safe" |
23:06
🔗
|
SN4T14 |
Yeah, that's why I always clone myself in three places. |
23:07
🔗
|
db48x |
I want to back myself up outside my current light-cone |
23:07
🔗
|
SN4T14 |
Dude, set up a RAIC array. :p |