Time |
Nickname |
Message |
09:48
π
|
jonbro_ |
heya! |
09:48
π
|
jonbro_ |
could anyone help me with a last minute archive project? |
09:49
π
|
jonbro_ |
a local pittsburgh message board is going down, and I am trying to wget it all before it dissapears. |
09:50
π
|
Schbirid |
step a: link the linky link |
09:50
π
|
jonbro_ |
nevertellmetheodds.org |
09:50
π
|
jonbro_ |
I don't have a warrior project, just looping over each thread |
09:51
π
|
jonbro_ |
supposedly he was going to take it down today, so I am not sure what the countdown clock actually looks like. |
09:52
π
|
Schbirid |
if the threads are enough, for i in {1..135000}; do wget -nv http://nevertellmetheodds.org/t.php?id=$i; done :D |
09:53
π
|
jonbro_ |
cool. |
09:53
π
|
jonbro_ |
thanks much! |
09:54
π
|
Schbirid |
that will not include the inline images though |
09:54
π
|
jonbro_ |
ah I am not worried about that, it is all offsite links. |
09:54
π
|
jonbro_ |
those were always inpermanent |
09:57
π
|
Schbirid |
for i in {102960..135000}; do wget -e robots=off -nv --page-requisites --span-hosts --reject-regex="(/favicon.ico|/style.css)" --exclude-domains=ajax.googleapis.com,www.businesscasualarchnemesis.com --convert-links --adjust-extension http://nevertellmetheodds.org/t.php?id=$i; done |
09:57
π
|
Schbirid |
small test, should grab inline images |
09:58
π
|
godane |
so i'm grabbing sitemap of funnyordie.com |
09:58
π
|
godane |
there is like 230 of them |
09:59
π
|
godane |
from there i can grab the video pages |
09:59
π
|
godane |
but i really don't have too |
09:59
π
|
jonbro_ |
hmmmm... |
10:00
π
|
godane |
the videos are hosted like this: http://vo.fod4.com/v/befceb53c6/v600.mp4 |
10:00
π
|
Schbirid |
i would recommend logging, add a "-a logfile_date.log". wget will print all messages there then |
10:00
π
|
godane |
the url is this: http://www.funnyordie.com/videos/befceb53c6/dani-weirdo-music-for-weirdos |
10:01
π
|
jonbro_ |
ok, will do :D got this running on two computers and an ec2 instance now :D |
10:02
π
|
Schbirid |
different number ranges, right? :) |
10:02
π
|
jonbro_ |
yep |
10:02
π
|
jonbro_ |
hopefully will finish before the owner wakes up and pulls the plug |
10:03
π
|
godane |
you need just the part of the url to get the video paths |
10:03
π
|
Schbirid |
:D |
10:03
π
|
godane |
also change video to v1200.mp4 so you get the high bit rate version |
10:05
π
|
Schbirid |
i started from 135k and 100k going down |
10:23
π
|
ivan` |
jonbro_: if you use wget --warc-file your archive could be put into wayback |
10:24
π
|
ivan` |
if there is no WARC, it's usually a lost cause |
10:25
π
|
jonbro_ |
oh really :( |
10:25
π
|
ivan` |
and starting one wget per URL is super dumb especially when you're grabbing page requisites |
10:25
π
|
jonbro_ |
can I warc it after the fact? |
10:25
π
|
ivan` |
but also because it doesn't reuse connections |
10:25
π
|
ivan` |
no |
10:25
π
|
jonbro_ |
oh rly???? |
10:25
π
|
jonbro_ |
oh shit. |
10:25
π
|
jonbro_ |
I mean, there is only one page requsite, just a style.css |
10:26
π
|
ivan` |
oh that's not so bad then |
10:26
π
|
jonbro_ |
ok, cool, thank god :D |
10:26
π
|
Schbirid |
ivan`: would you rather make an input file with urls? thought about that too |
10:27
π
|
ivan` |
that, or a terrible hack like this is what I'm using for one site |
10:27
π
|
ivan` |
end = (i+1)*10000 |
10:27
π
|
ivan` |
for i in range(0, 107): |
10:27
π
|
ivan` |
print start, end |
10:27
π
|
ivan` |
start = i*10000 + 1 |
10:27
π
|
ivan` |
os.system(r'wget --output-document=tmp --warc-file=bugzilla.redhat.com-%s-%s --warc-cdx -e robots=off https://bugzilla.redhat.com/show_bug.cgi\?id={%s..%s}' % (str(start).zfill(8), str(end).zfill(8), start, end)) |
10:27
π
|
jonbro_ |
ah interesting. |
10:28
π
|
jonbro_ |
is that faster? |
10:28
π
|
ivan` |
yes |
10:28
π
|
Schbirid |
oh wow yes |
10:28
π
|
jonbro_ |
ok, once one of these chunks end will switch over to that. |
10:29
π
|
Schbirid |
for i in {0..135000}; do echo "http://nevertellmetheodds.org/t.php?id=$i" >> urls; done |
10:29
π
|
Schbirid |
wget -nv --convert-links --adjust-extension --warc-file=nevertellmetheodds.org_20140301 --warc-cdx -i urls |
10:29
π
|
Schbirid |
several times as fast |
10:29
π
|
* |
Schbirid hides in a corner |
10:29
π
|
ivan` |
yes |
10:30
π
|
Cameron_D |
Just stick & after the wget call. Parallelise things :-) |
10:30
π
|
ivan` |
heh |
10:30
π
|
Schbirid |
"how to annoy a webserver" :P |
10:30
π
|
ivan` |
how to get a precious IPv4 IP banned for life |
10:30
π
|
ersi |
:D |
10:31
π
|
ersi |
is this #superhackysatursayday? |
10:31
π
|
Cameron_D |
Pretty sure you system would fall over first, trying to start 135k processes at once :P |
10:31
π
|
Smiley |
nah, i'll be fine |
10:31
π
|
Smiley |
you'll OOM pretty quickly tho |
10:31
π
|
Smiley |
the voice of experience ^^^ |
10:32
π
|
unbeholde |
schibirid I have a small problem |
10:32
π
|
Schbirid |
another unsatistied customer :(( |
10:33
π
|
unbeholde |
just found that stealth project alpha is corrupt and I tried to download it but it failed shortly before finishing. Same goes for Wonderful Life |
10:33
π
|
Schbirid |
give me the urls and i will re-grab them, might have been my fault |
10:34
π
|
jonbro_ |
can I make multiple warc files? or is that a no no if I want to chuck it on archive.org |
10:34
π
|
Cameron_D |
you can make multiple and then merge them, I beleive |
10:35
π
|
garyrh |
jonbro_, https://github.com/alard/megawarc |
10:36
π
|
jonbro_ |
cool! |
10:38
π
|
unbeholde |
what was that text upload service called? |
10:39
π
|
Schbirid |
pastee.org is nice |
10:39
π
|
unbeholde |
hmm ah here ok I got it up on pastebin http://pastebin.ca/2648419 |
10:40
π
|
Schbirid |
ok |
11:08
π
|
Schbirid |
unbeholde: https://www.quaddicted.com/files/temp/fp/ |
11:08
π
|
Schbirid |
7z says they are intact |
11:33
π
|
unbeholde |
ah thank you wonderfile life is working, just waiting for the stealth project to finish! By the way I'm having trouble uploading Ons2.0.zip and ut3domfinal_winsetup.zip at the moment, it failed on me twice. Do you happen to have a direct link for them? |
11:36
π
|
* |
unbeholde slaps Schbirid around a bit with a large fishbot |
11:36
π
|
Schbirid |
yay |
11:36
π
|
Schbirid |
uploading? |
11:37
π
|
unbeholde |
mmm to the almighty modDB |
11:37
π
|
Schbirid |
nothing more direct than the tarview links |
11:40
π
|
Nemo_bis |
Meh, why does a search for "rare chunks" only bring up academic papers? BitTorrent can't be more stupid than eMule, can it? I hope peers send out rare chunks first. https://encrypted.google.com/search?q=bittorrent+"rare+chunks" |
11:41
π
|
unbeholde |
it says tarview.php isn't a recognised file format |
11:42
π
|
Schbirid |
Nemo_bis: there is super seeding but apart from that iirc clients request what they want |
11:47
π
|
Nemo_bis |
Really? But clients don't even know the other leechers, they can only ask the wrong chunks... |
11:49
π
|
Schbirid |
https://wiki.theory.org/BitTorrentSpecification#Piece_downloading_strategy |
11:52
π
|
Nemo_bis |
Hm. https://trac.transmissionbt.com/ticket/3767 is marked fixed but I'm rather sure chrystal ball feature has not been implemented yet. |
11:55
π
|
Nemo_bis |
Dunno how much http://www.cs.sfu.ca/~mhefeeda/Papers/ism09.pdf applies |
16:30
π
|
Asparagir |
Need ops in ArchiveBot plz. Want to feed in Crimean sites, including media, while there's still time. |
23:30
π
|
gui77 |
given that myopera bebo and canvas are already at the max rate and push is (practically) finished, is there any other project needing bandwidth? :) |
23:55
π
|
sanqui |
hello. I know you folks are not archive.org, but I've been wondering if anybody can tell me if I'm fine uploading something there |
23:56
π
|
sanqui |
or if there's a better place for both short-term availability and long-term preservation |
23:57
π
|
sanqui |
basically, I don't know if you've heard about twitch plays pokΓΒ©mon, but it's been quite a phenomenon: https://en.wikipedia.org/wiki/Twitch_Plays_Pok%C3%A9mon |
23:57
π
|
sanqui |
I've got nearly 3gb of logs, and many people have asked me to provide them for academical study |
23:58
π
|
sanqui |
it's just under 400mb compressed, but I still don't want to host it myself |