Time |
Nickname |
Message |
00:04
🔗
|
Coderjoe |
I was continuing from what underscor said (the last person to say anything before me), and he was continuing from what dashcloud said (the last before underscor to say something) |
01:46
🔗
|
DFJustin |
they're not uploading the full res to IA anyway |
02:53
🔗
|
Coderjoe |
DFJustin: of course not. why upload for free what you can put on DVD/bluray and make money from? |
02:54
🔗
|
Coderjoe |
(there has been an avgeeks collection at IA for years, with very few uploads added to it. Instead, Skip preferred selling his DVDs on his website) |
03:00
🔗
|
godane |
Coderjoe: http://archive.org/details/avgeeks |
03:00
🔗
|
godane |
there still at it |
03:01
🔗
|
godane |
the least one was upload 3 days ago |
03:01
🔗
|
Coderjoe |
you think I don't know that? |
03:01
🔗
|
Coderjoe |
before this 100 miles project, uploads to that collection were very few and far between |
03:05
🔗
|
Coderjoe |
hmm |
03:05
🔗
|
Coderjoe |
I apparently missed a burst or two. I didn't realize there are currently 794 items in that collection |
03:06
🔗
|
godane |
yes but the last 20 to 30 is only from 2012 |
03:09
🔗
|
Coderjoe |
earlier this week, or perhaps the end of last week, I was comparing the postings on the tumblr with the uploads on IA. there weren't a whole lot of recent non-prelinger uploads by skip |
03:09
🔗
|
godane |
also i'm still download kat.ph/community |
03:09
🔗
|
godane |
ok |
03:09
🔗
|
Coderjoe |
hmm |
03:10
🔗
|
Coderjoe |
I had not known about a large number he uploaded in 2011 |
03:10
🔗
|
Coderjoe |
and this should probably be in -bs |
03:21
🔗
|
ivan` |
"NOTICE: Due to continous abuse, privatepaste.com will be shutting down August 1st, 2012." |
03:22
🔗
|
ivan` |
too bad Google does not know about any pastes on it, I guess this is the "private" part |
03:23
🔗
|
ivan` |
https://encrypted.google.com/search?hl=en&source=hp&q="https+privatepaste.com" |
03:24
🔗
|
Coderjoe |
the only pastes that would show up on google are ones that were on some other website that they crawled |
03:28
🔗
|
nitro2k01 |
So, what now? Automatic crawling of other pastebins to grab as much of privatepaste as possible? |
03:30
🔗
|
shaqfu |
Is it possible to grab privatepastes via attrition? |
03:30
🔗
|
shaqfu |
Find how it generates URLs, then step through each and toss out dead links (probably with --redirects=0) |
03:30
🔗
|
ivan` |
I'm grepping all of my IRC logs for privatepaste |
03:30
🔗
|
ivan` |
they have subdomains like http://pgsql.privatepaste.com/a9940ba8de |
03:31
🔗
|
shaqfu |
16^10...oof |
03:31
🔗
|
nitro2k01 |
The URLs are likely either random or dependent on the text |
03:32
🔗
|
shaqfu |
nitro2k01: That's why I was curious if attrition would work |
03:33
🔗
|
shaqfu |
But the space is too great |
03:50
🔗
|
Coderjoe |
mmm |
03:50
🔗
|
Coderjoe |
1 trillion IDs |
03:51
🔗
|
shaqfu |
1 trillion HTTP requests |
03:51
🔗
|
shaqfu |
Wonder if we'd melt their network cards |
03:55
🔗
|
nitro2k01 |
Probably more like melting their patience |
03:55
🔗
|
nitro2k01 |
"That's it, we're closing early" |
03:55
🔗
|
nitro2k01 |
Also, knowing all valid IDs would do no good without the password for some of them |
03:55
🔗
|
Coderjoe |
august 1, 2012 is in the past |
03:56
🔗
|
nitro2k01 |
So be it |
10:30
🔗
|
emijrp |
i have an issue with zip explorer |
10:30
🔗
|
emijrp |
http://ia600503.us.archive.org/zipview.php?zip=/21/items/Spanishrevolution-UnAnyoDeTrabajoEnLasPlazasYBarrios.ActasDel15m/Actas15M-0001-0500.zip |
10:30
🔗
|
emijrp |
all the files are downlaoded as 0 bytes |
10:30
🔗
|
emijrp |
and do not open |
10:34
🔗
|
Nemo_bis |
emijrp: how big is the file? |
10:34
🔗
|
Nemo_bis |
sigh, I suppose this is he.net fault again? http://p.defau.lt/?EhO6n_E45JN4GzsGwOCzAQ |
10:35
🔗
|
Nemo_bis |
^ this will make a single wiki take a couple days to upload, emijrp |
10:35
🔗
|
emijrp |
ok |
10:35
🔗
|
emijrp |
:P |
10:39
🔗
|
emijrp |
the zip is just 50 MB |
10:40
🔗
|
Nemo_bis |
hmm |
10:41
🔗
|
Nemo_bis |
filenames seem plain enough |
13:21
🔗
|
ersi |
Linked an article about "How to crawl a quarter billion webpages in 40 hours" in #archiveteam-bs |
16:13
🔗
|
brayden |
Interesting story from a friend. He setup some terrible, terrible... awful disgusting pile of work of a website on some free web hosting thing. For some reason he decided to see if the host was still even around, and sure enough, they were. Amazingly even after nearly 6 years of inactivity his account was still there, active, his site was not defaced and all the info is still there. |
16:13
🔗
|
brayden |
God damn.. very interesting to read. |
16:13
🔗
|
* |
brayden has learned the value of saving data! |
16:16
🔗
|
brayden |
lol and now he tried to transfer it to one of his dedicated machines. Says the RAM/CPU has gone insane and he has to reboot the box. |
18:22
🔗
|
shaqfu |
So, the guy with all the computer mags just got back to me |
18:22
🔗
|
shaqfu |
He has a brief list of the mags; anyone interested? |
18:26
🔗
|
godane |
hey shaqfu |
18:26
🔗
|
shaqfu |
godane: Yo |
18:26
🔗
|
godane |
i'm still downloading kat.ph community |
18:26
🔗
|
godane |
close to 300mb .warc.gz |
18:26
🔗
|
shaqfu |
Awesome |
18:27
🔗
|
godane |
do you know much about grep? |
18:27
🔗
|
shaqfu |
What are you trying to grep? |
18:27
🔗
|
godane |
all photobucket.com images in my kat.ph dump |
18:28
🔗
|
shaqfu |
Shouldn't those be picked up by --page-requisites? Or did you disable --span-hosts? |
18:28
🔗
|
godane |
i didn't had that |
18:28
🔗
|
shaqfu |
Oh, hm |
18:29
🔗
|
shaqfu |
Should be reasonable to grep out photobucket.com and scrape out the URLs |
18:29
🔗
|
godane |
i also wouldn't it start going after all the internet |
18:29
🔗
|
shaqfu |
You have to limit its levels |
18:29
🔗
|
shaqfu |
Probably to 1 |
18:29
🔗
|
godane |
sense there are images everywhere |
18:31
🔗
|
godane |
also i was trying to only get stuff from kat.ph/community |
18:32
🔗
|
godane |
i add a --wait 1 so this took a very long time |
18:33
🔗
|
shaqfu |
Ah |
18:33
🔗
|
godane |
i had to cause it failed on me before |
18:35
🔗
|
godane |
there is also alot of 404 errors |
18:57
🔗
|
alard |
godane: Are the image urls absolute? |
18:57
🔗
|
alard |
That would make the grepping much easier. |
18:58
🔗
|
godane |
there is spaces in some |
18:59
🔗
|
alard |
With absolute I meant that the url is not relative to the page it is on. E.g. ../../images/test.png is relative, http://photobucket.com/something.png or just /something.png is absolute. |
19:00
🔗
|
alard |
Ah, now that I read your messages again: the pages are from kat.ph and they link to photobucket.com? Then the urls must be absolute. |
19:00
🔗
|
godane |
yes there absolute |
19:01
🔗
|
alard |
Then grepping is enough, since you don't need to know where the urls came from. |
19:01
🔗
|
alard |
Let me see. |
19:02
🔗
|
alard |
grep -ohP '<img[^>]+src="[^">]+"' |
19:02
🔗
|
alard |
Does that produce something? |
19:02
🔗
|
balrog_ |
alard: I have a simple feature request for wget |
19:03
🔗
|
balrog_ |
it would be nice if it had an option which would make it query for indices in all directories |
19:03
🔗
|
balrog_ |
so if an html file has pictures in /img, it should also query /img in case the httpd has indices turned on |
19:03
🔗
|
balrog_ |
err, embedded pictures |
19:03
🔗
|
balrog_ |
it should do that in general with all subdirs, with that option enabled |
19:03
🔗
|
balrog_ |
do you follow? :) |
19:04
🔗
|
alard |
balrog_: Yes. Have you tried it yourself? :) |
19:04
🔗
|
alard |
Alternatively, since it is quite site-specific, you might want to use a Lua script that does this. |
19:05
🔗
|
balrog_ |
is it site specific? |
19:05
🔗
|
balrog_ |
have I tried to add that feature? not atm |
19:05
🔗
|
godane |
holy crap |
19:05
🔗
|
godane |
your code make a 31.3mb text file of image urls |
19:06
🔗
|
alard |
balrog_: I'd say it is rather specific: some sites will have /index.html, others /. I don't think it's a problem that is general enough to need a special Wget option. |
19:06
🔗
|
alard |
godane: You probably need a second grep to postprocess the output. |
19:06
🔗
|
balrog_ |
uhh, I don't think you follow! |
19:06
🔗
|
godane |
in got all of them i think too |
19:06
🔗
|
godane |
i know |
19:06
🔗
|
balrog_ |
what I meant was that it would be nice if wget could query for indices automatically |
19:07
🔗
|
balrog_ |
even if said indices aren't linked from anywhere |
19:07
🔗
|
alard |
balrog_: Yes, I think I do understand. If it finds /img/something.png you want it to request /img/ as well. |
19:07
🔗
|
balrog_ |
yes |
19:10
🔗
|
alard |
I think the fastest way to get that result is by writing a small Lua script. |
19:11
🔗
|
alard |
That's what I would do, at least, if I were you. |
21:12
🔗
|
godane |
hey shaqfu |
21:47
🔗
|
dashcloud |
shaqfu: I thought someone else would have asked to see the list of magazines by now, but in any event, I would be interested in the list |
21:50
🔗
|
godane |
i got the kastatic.com images now |