Time |
Nickname |
Message |
00:15
🔗
|
deathy |
any other very near future projects we know about but haven't been started yet? or have things calmed down for now? |
02:53
🔗
|
dashcloud |
so, how do you archive twitter accounts? if someone knows, Dan Kaminsky mentioned that https://twitter.com/conorpotpie died recently |
03:05
🔗
|
deathy |
don't know anything first hand, but there are some tools mentioned on the wiki: http://www.archiveteam.org/index.php?title=Twitter |
03:06
🔗
|
deathy |
out of 3 tools, one is dead/broken link and another requires your login/pass (obviously not possible here..) |
03:10
🔗
|
deathy |
and the 3rd one is also useless. URL params and classes used in page not the same anymore. |
03:15
🔗
|
deathy |
updated wiki, made it clear 3rd tool is not usable anymore |
03:16
🔗
|
deathy |
perhaps it would be good to add one there that actually works.. |
03:31
🔗
|
xmc |
deathy, dashcloud: archive.is seems to expand the whole page of a twitter account |
03:33
🔗
|
deathy |
nope |
03:33
🔗
|
deathy |
not really |
03:34
🔗
|
deathy |
tried on that account dashcloud mentioned. It was doing at least a few URL requests which I recognized from monitoring the twitter infinite scroll thing |
03:34
🔗
|
deathy |
but it maybe got 1-3 additional pages/scrolls |
03:34
🔗
|
deathy |
from more than 50 I think on that user at least |
03:36
🔗
|
deathy |
(used a lot of PgDn keys while looking at it.. ) |
03:36
🔗
|
xmc |
ah |
03:38
🔗
|
deathy |
from archive.is: "There is 5 minutes timeout, if page is not fully loaded in 5 minutes, the saving considered failed. It is not often, but it happens." |
03:38
🔗
|
deathy |
and maybe for extreme cases this could be an issue (if lots of twitter pics): "The stored page with all images must be smaller than 50Mb" |
03:42
🔗
|
deathy |
and double/triple-confirmed, from archive.is blog, issues with twitter: http://blog.archive.is/post/51400352393/it-seems-that-twitter-feeds-with-a-lot-of-tweets-500 |
04:13
🔗
|
xmc |
mm |
07:02
🔗
|
SketchCow |
Where's the hug. |
07:05
🔗
|
* |
BlueMaxim hugs SketchCow |
08:35
🔗
|
* |
Nemo_bis read "bug" |
09:07
🔗
|
arkiver |
why are wikipedia pages saved so bad in the wayback machine? |
09:07
🔗
|
arkiver |
http://web.archive.org/save/http://nl.wikipedia.org/wiki/Hoofdpagina |
10:19
🔗
|
jonas_ |
hi=) what about getting more yahoo blogs from google cache (and gigablast cache) (or a cdn avaiable for this if any)? |
10:45
🔗
|
m1das |
jonas_: join #shipwretched and update your y!b code for new grabs |
11:31
🔗
|
arkiver |
looks like the archive.is saves are made like #### |
11:31
🔗
|
arkiver |
would make it possible to save |
11:31
🔗
|
arkiver |
14.776.336 combinations |
11:31
🔗
|
arkiver |
then run a program on then to only have the existing ones and discovering all the other urls |
11:31
🔗
|
arkiver |
and then download |
11:44
🔗
|
arkiver |
looks like they also have ##### (with 5) now... |
11:45
🔗
|
arkiver |
the #### is full, all of them are used, so that would be good to archive |
11:45
🔗
|
arkiver |
will try to start a grab on that... :) |
12:09
🔗
|
arkiver |
generated all urls from aaaa-0000 |
12:10
🔗
|
arkiver |
now starting the url discovery of the first batch: aaaa-a000 |
12:10
🔗
|
arkiver |
or nah, gonna do aaaa-d000 |
12:37
🔗
|
antomatic |
jonas_: Apparently google cache is really hard to archive from, they rate-limit so agressively that it's almost impossible to do on any scale |
12:41
🔗
|
BiggieJo1 |
pretty much need a massive block of IP's and randonly scatter requests accross the block |
12:56
🔗
|
Nemo_bis |
so far I'm not having problems with concurrency 4 |
12:57
🔗
|
joepie91 |
there's an old piece of software in Perl that does Google Cache extraction pretty well afaik |
13:06
🔗
|
ivan` |
I waited 2 minutes between requests to google cache and it worked fine |
13:06
🔗
|
antomatic |
is that wretch or blogs or both, Nemo_bis ? |
13:06
🔗
|
antomatic |
sounds encouraging |
13:07
🔗
|
Nemo_bis |
blogs |
13:07
🔗
|
antomatic |
cool |
13:07
🔗
|
Nemo_bis |
but I see I uploaded only 14 items so far, dunno what's going on for real |
13:07
🔗
|
arkiver |
are you talking about getting the yahoo things from google cache? |
13:15
🔗
|
arkiver |
archive.is is blocked from the IA for some reason... |
13:15
🔗
|
arkiver |
:( |
15:46
🔗
|
Cowering |
anyone else blocked from dropbox for too much BW, even when you know it is not true? |
16:47
🔗
|
balrog |
Cowering: like blocked completely, or one file blocked? |
17:32
🔗
|
arkiver |
etsi.org/deliver/ save almost complete |
17:32
🔗
|
arkiver |
working on wikileaks website |
18:21
🔗
|
chavezery |
don't know if anyone would want to keep this, but it's here in case: https://bui.pm/ded |
18:21
🔗
|
chavezery |
it was a /b/ archive, dead now, images and database stuff is up for grabs |
18:23
🔗
|
chavezery |
and with that, it's time for me to leave |
18:23
🔗
|
chavezery |
later all o/ |
18:40
🔗
|
joepie91 |
being archived |
19:51
🔗
|
alexvoda |
hello |
19:51
🔗
|
alexvoda |
WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD |
19:52
🔗
|
BiggieJo1 |
looks like the bot is not responding - secret word is yahoosucks |
19:52
🔗
|
alexvoda |
thank you |
19:53
🔗
|
alexvoda |
also I might as well ask this over here too |
19:53
🔗
|
alexvoda |
I want to help archive myopera |
19:53
🔗
|
alexvoda |
how can I help? |
19:54
🔗
|
BiggieJo1 |
info page is here - http://archiveteam.org/index.php?title=My_Opera |
19:54
🔗
|
alexvoda |
yes I know |
19:54
🔗
|
alexvoda |
but what can I do |
19:55
🔗
|
alexvoda |
It doesn't seam set up for use with the warrior |
19:55
🔗
|
BiggieJo1 |
looks like it's not a warrior project, would need to check with Mithrandir who is running that project |
19:55
🔗
|
alexvoda |
oh |
19:56
🔗
|
joepie91 |
BiggieJo1: wait, we have a bot responding with yahoosucks? |
19:57
🔗
|
alexvoda |
I see, hmm there seams to be no contact info on his wiki page |
19:57
🔗
|
joepie91 |
irony, a bot responding to a question for the anti-bot system... |
19:57
🔗
|
alexvoda |
just a public key |
19:58
🔗
|
alexvoda |
does anyone know if he is regularly on IRC? |
19:58
🔗
|
nico_32 |
joepie91: where ? |
19:58
🔗
|
joepie91 |
nico_32: see what BiggieJo1 said |
19:58
🔗
|
BiggieJo1 |
arrgh, nick broken again |
19:59
🔗
|
joepie91 |
not broken, just in an alternative state of functioning |
19:59
🔗
|
joepie91 |
:) |
20:00
🔗
|
alexvoda |
or should I write in the discussion page for myopera or for him? |
20:02
🔗
|
nico_32 |
try to write on his discution page |
20:02
🔗
|
nico_32 |
it "should" make mediawiki show him/her/hir a message at next login |
20:03
🔗
|
alexvoda |
ok, thanks for the advice |
20:13
🔗
|
DeVan |
Just found a webhost with insane amount of datasheets |
20:13
🔗
|
nico_32 |
DeVan: url ? |
20:14
🔗
|
DeVan |
I dont know |
20:15
🔗
|
nico_32 |
... |
20:15
🔗
|
DeVan |
afraid the swarm will make him close it |
20:16
🔗
|
nico_32 |
send it privatly and i will make a very slow download |
20:17
🔗
|
nico_32 |
with --delay 20s |
20:17
🔗
|
balrog |
electronicsandbooks? |
20:17
🔗
|
DeVan |
yeah |
20:17
🔗
|
balrog |
people have archived that before; it's very very slow |
20:20
🔗
|
DeVan |
balrog: not that site |
20:22
🔗
|
nico_32 |
last IA crawling: 2011/2012 |
21:37
🔗
|
arkhive |
would be good maybe to backup xbins (xbox-scene.com file downloads) |
21:39
🔗
|
arkhive |
http://www.xbins.org/ and the actual files obtained via IRC and FTP |
21:39
🔗
|
arkhive |
http://www.xbox-scene.com/articles/xbins.php |
21:39
🔗
|
arkhive |
tutorial. |
21:41
🔗
|
arkhive |
I downloaded a whole bunch about ten years ago but all the files probably have newer releases/versions and i never got the whole lot. |
21:42
🔗
|
arkhive |
Also, I think there was a limit on how many you could get a day. or hour. Can't remember though. |
21:42
🔗
|
arkiver |
arkhive: I will run an url discovery program tomorrow on those and see how big the sites are and how many files they contain |
21:43
🔗
|
arkiver |
will then start a grab |
21:43
🔗
|
arkhive |
okay. i think they are hosted off site. like not on xbins.org. can't remember though. I was like 13 lol |
21:43
🔗
|
DFJustin |
joepie91: oh are you grabbing that /b/ thing, I just emailed jason about it |
21:43
🔗
|
arkhive |
but i'll get those CD's i put the xbins files on. |
21:44
🔗
|
joepie91 |
DFJustin: yes, one of my boxes is downloading it atm |
21:46
🔗
|
DFJustin |
\o/ |
21:46
🔗
|
arkhive |
I still need to continue my saving of old apps/programs from dead/zombied mobile platforms. good article if i remember right lol. http://www.visionmobile.com/blog/2012/01/the-dead-platform-graveyard-lessons-learned-2/ |
21:46
🔗
|
arkhive |
but to -bs for me. :) |
21:56
🔗
|
joepie91 |
for the record, I have several servers with 500G disk now |
21:56
🔗
|
joepie91 |
so anything up to that size, I can fetch |
21:56
🔗
|
joepie91 |
(ping me when necessary) |
21:59
🔗
|
yipdw |
joepie91: !a https://bui.pm/ded |
21:59
🔗
|
joepie91 |
yipdw: it's too big for that |
22:00
🔗
|
yipdw |
I should add that to ArchiveBot |
22:00
🔗
|
yipdw |
"Sorry, this item is too big" |
22:04
🔗
|
bsmith093 |
http://bofh.nikhef.nl/events/ this seems to be a mirror for... everything con related for tech things. how can i get a size, without just dl-ing all of it? |
22:06
🔗
|
arkiver |
arkhive: I will d the xbox-scene website first, since those download are on-site |
22:06
🔗
|
arkiver |
and then I'll take a look at the other one |
22:06
🔗
|
arkiver |
:) |
22:11
🔗
|
joepie91 |
bsmith093: not, probably |
22:11
🔗
|
joepie91 |
unless you script a bit |
22:13
🔗
|
bsmith093 |
joepie91: I'm gonna run a wget spider , then dump the log to a url extractor, then dump that into jdownloader, so i atleast know how big it is. |
22:13
🔗
|
joepie91 |
oh man :P |
22:15
🔗
|
arkiver |
bsmith093: I will try to get the size and the number of urls for you tomorrow |
22:16
🔗
|
bsmith093 |
ark i'm already running the spider thats faster, isnt it? |
22:16
🔗
|
yipdw |
mentioning jdownloader to joepie91 is like talking to a TEA Party member about taxes |
22:16
🔗
|
yipdw |
the cool thing about US-centric similes is that they're always worse than you intend |
22:16
🔗
|
yipdw |
:P |
22:17
🔗
|
bsmith093 |
joepie91: whats wrong with this workflow? seriously I'd love any suggestions :) |
22:17
🔗
|
joepie91 |
bsmith093: it just seems a bit... duct-tapey :) |
22:17
🔗
|
joepie91 |
aside from the jdownloader bit |
22:19
🔗
|
arkiver |
hmm, we can see how accurate jdownloader is, please let me know the size of the site tomorrow |
22:20
🔗
|
* |
m1das moves to #archiveteam-bs and opens the popcorn |
22:23
🔗
|
ivan` |
bsmith093: I will tell you in a few minutes |
22:24
🔗
|
ivan` |
I am using HTTrack and find . -name 'index.html' | xargs cat | grep 'alt="\[ \]">' | perl -p -i -e 's/ +/ /g' | python -c "exec 'import sys\nfor line in sys.stdin: print line.strip().split()[-1]'" | sed 's/G/*1024*1024*1024/g' | sed 's/M/*1024*1024/g' | sed 's/K/*1024/g' |
22:25
🔗
|
ivan` |
buggy, ask me for fixed version later if you want it |
22:31
🔗
|
Nemo_bis |
regex ftw |
22:40
🔗
|
ivan` |
400GB so far but it's still grabbing indexes |
22:40
🔗
|
ivan` |
find . -name 'index.html' | xargs cat | grep 'alt="\[...\]">' | grep -v 'alt="\[DIR\]">' | perl -p -i -e 's/ +/ /g' | python -c "exec 'import sys\nfor line in sys.stdin: print line.strip().split()[-1]'" | sed 's/G/*1024*1024*1024/g' | sed 's/M/*1024*1024/g' | sed 's/K/*1024/g' | python -c "exec 'import sys\nprint sum(int(eval(line.strip(), {\'__builtins__\': None})) for line in sys.stdin)'" |
23:57
🔗
|
bsmith093 |
well ok then, ivan` you can grab it if you eant, holy crap thats big |
23:57
🔗
|
bsmith093 |
*want |