Time |
Nickname |
Message |
00:31
🔗
|
ryonaloli |
anyone have advice on scraping a site from archive.org with wget? |
00:32
🔗
|
APerti |
Can't you just get the WARCs for the site? |
00:33
🔗
|
ryonaloli |
i have no idea what those are :/ |
00:33
🔗
|
ryonaloli |
i'm kinda new to this |
01:53
🔗
|
SketchCow |
WHat are you trying to get? |
02:53
🔗
|
zenguy_pc |
hi |
02:54
🔗
|
zenguy_pc |
i was thinkign about arching fsrn.org , i saw that it's creative commons license.. is that being archived already? |
02:57
🔗
|
giganticp |
wat is sekrit word |
02:57
🔗
|
zenguy_pc |
? |
02:57
🔗
|
balrog |
The secret word is "yahoosucks" |
02:58
🔗
|
zenguy_pc |
https://clbin.com/mgntE |
02:58
🔗
|
zenguy_pc |
if anyone is interested |
02:58
🔗
|
zenguy_pc |
didn't grab it yet |
04:55
🔗
|
Jonimus |
trying to emulate old games which had even basic copy protections sucks. |
04:57
🔗
|
APerti |
Which games/copy protection? |
05:04
🔗
|
Jonimus |
nvm, it turns out it was a "have you read the manual" check |
05:04
🔗
|
Jonimus |
and of course Archive.org has a scan of the manual ;D |
05:08
🔗
|
APerti |
Nice. |
06:25
🔗
|
ryonaloli |
<@SketchCow> WHat are you trying to get? |
06:25
🔗
|
ryonaloli |
an imageboard called gurochan which died weeks ago. i'm part of the team that created a new one but we need the original's archives |
06:28
🔗
|
ryonaloli |
https://web.archive.org/web/20140106164316/http://gurochan.net/ (link is sfw, following links will be nsfw) |
07:09
🔗
|
SketchCow |
OK, so you want to pull from the internet archive WARCs |
07:10
🔗
|
SketchCow |
https://archive.org/details/gurochan_archive_2006-2010 |
07:10
🔗
|
SketchCow |
(Obviously not perfect, I just happened to notice this) |
07:12
🔗
|
ryonaloli |
SketchCow: that's only the images with unix timestamps, and we already have those. what we need is the original thread structure |
07:12
🔗
|
SketchCow |
Anyway, what I'm seeing here is that archive.org has semi-irregular grabs. |
07:12
🔗
|
ryonaloli |
how do i use WARCs? i looked it up but could only get descriptions of the format, not how to creat it |
07:13
🔗
|
SketchCow |
But it's probably what you're looking for. |
07:13
🔗
|
SketchCow |
You can probably yank from the wayback. |
07:14
🔗
|
ryonaloli |
i'm not sure hwo to use them to take from wayback |
07:16
🔗
|
SketchCow |
http://waybackdownloader.com/ |
07:16
🔗
|
SketchCow |
Maybe |
07:16
🔗
|
SketchCow |
I'm looking for utilities. |
07:17
🔗
|
ryonaloli |
>pricing and order form |
07:18
🔗
|
SketchCow |
Yes. |
07:19
🔗
|
SketchCow |
$15, might not be bad. |
07:19
🔗
|
ryonaloli |
we're already on a tight budget to run the current site. we can't afford to spending $15 when there's probably a way to do even with a firefox macro |
07:19
🔗
|
SketchCow |
Otherwise, write a script and scrape like crazy. |
07:20
🔗
|
SketchCow |
Sounds like you have it all under control. Good luck. |
07:20
🔗
|
ryonaloli |
it requires javascript to view pages though, right? |
07:20
🔗
|
SketchCow |
No idea. |
07:21
🔗
|
ryonaloli |
hm |
07:22
🔗
|
ryonaloli |
it seems the web archive tries it's hardest to make scraping impossible |
07:33
🔗
|
ryonaloli |
heh, that site's faq all link to 404 |
07:48
🔗
|
yipdw |
you can check out https://github.com/alard/warc-proxy |
07:48
🔗
|
yipdw |
it's a tool which reads WARCs and reconstructs HTTP responses from those WARCs |
07:49
🔗
|
ryonaloli |
but how do i create a warc? |
07:49
🔗
|
midas |
but remember kids, just because it's an archive file doesnt make it a backup. |
07:49
🔗
|
midas |
ryonaloli: wget has a special flag for that |
07:50
🔗
|
yipdw |
you can also use wpull --warc-file |
07:51
🔗
|
yipdw |
if there's a bunch of WARCs in a tarball, you can use https://github.com/ArchiveTeam/megawarc |
07:52
🔗
|
yipdw |
I'm not sure why you need to create a WARC to retrieve thread structure from (some hypothetical) WARC, though |
07:52
🔗
|
ryonaloli |
i'm still not sure how to turn a wayback link into a warc |
07:52
🔗
|
yipdw |
oh, the Wayback Machine's WARCs aren't publicly accessible |
07:52
🔗
|
yipdw |
well, most of them aren't, but that's not an important detail |
07:52
🔗
|
midas |
neither are we ryonaloli, making a warc from wayback would only recreate the wayback http response |
07:53
🔗
|
ryonaloli |
heh |
07:53
🔗
|
midas |
besides that, grabbing all of the wayback machine might fill your drive up pritty fast |
07:53
🔗
|
ryonaloli |
then what would be the best way to scrape a site without paying $15? |
07:53
🔗
|
ryonaloli |
all? nah, just a website with <10 gigs |
07:53
🔗
|
midas |
pay 15 bucks. |
07:53
🔗
|
midas |
just pay the 15 bucks. |
07:53
🔗
|
yipdw |
you could write your own scraper |
07:55
🔗
|
ryonaloli |
i'm not sure how i'd write it if archive.org tries it's best to block those. as for the $15, this is for a site with a very low budget |
07:55
🔗
|
yipdw |
looking at gurochan.net captures it doesn't seem like it'd be all that difficult |
07:55
🔗
|
yipdw |
eh? |
07:55
🔗
|
yipdw |
I've never been blocked from downloading on any archive.org subdomain |
07:55
🔗
|
yipdw |
what gave you the impression that you'd be blocked? |
07:55
🔗
|
yipdw |
I mean, okay, maybe if you consume a ridiculous proportion of their bandwidth |
07:55
🔗
|
yipdw |
but you don't need to do that |
07:56
🔗
|
ryonaloli |
oh, i looked it up and most answers said it requires javascript to ge tinternal links |
07:56
🔗
|
yipdw |
what does, Wayback? |
07:56
🔗
|
ryonaloli |
i think so |
07:58
🔗
|
yipdw |
I don't know what that means |
07:58
🔗
|
yipdw |
I can access any archived URL on gurochan with curl |
07:59
🔗
|
yipdw |
e.g. $ curl -vvv 'http://web.archive.org/web/20100611210558/http://gurochan.net/dis/res/1109.html' works |
07:59
🔗
|
ryonaloli |
hm, i'll probably have to try again then |
07:59
🔗
|
yipdw |
I don't know where you read that accessing Wayback either (a) results in bans or (b) requires Javascript |
08:00
🔗
|
yipdw |
wherever you read that is wrong |
08:14
🔗
|
ryonaloli |
when i try "wget -np -e robots=off --mirror --domains=staticweb.archive.org,web.archive.org 'https://web.archive.org/web/20140106164316/http://gurochan.net/'", only the main page is downloaded, it doesn't go into any other links that have don't have '20140106164316' |
08:14
🔗
|
ryonaloli |
how do i let it go recursively into the rest without it trying to archive all of archive.org? |
09:22
🔗
|
midas |
I still dont understand what you're trying to do. but i'd start with getting this warc file: https://archive.org/details/gurochan_archive_2006-2010 |
09:23
🔗
|
midas |
grab the warc-proxy and start working from there. |
09:23
🔗
|
midas |
FYI, warc proxy has a readme. |
09:23
🔗
|
ryonaloli |
i already have that file. midas: what i'm trying to do is retrieve the threads from the archive. the wget command doesn't seem to recursively follow links |
09:24
🔗
|
midas |
thats because you're trying to use the wayback machine, it's not made for doing that |
09:24
🔗
|
midas |
warc proxy + that warc file, should be enough to get you going |
09:24
🔗
|
ryonaloli |
that's the only thing i can use though. that 2006-2010 archive is just a bunch of images, not the threads or the original filenames |
09:41
🔗
|
midas |
ryonaloli: wget -e robots=off --mirror --domains=staticweb.archive.org,web.archive.org https://web.archive.org/web/20140207233054/http://gurochan.net/ |
09:42
🔗
|
midas |
grabs it all, good luck getting it into something usefull, cant help you with that |
09:44
🔗
|
ryonaloli |
midas: does that also grab every previous version? |
09:45
🔗
|
midas |
everything is everything ryonaloli |
09:46
🔗
|
ryonaloli |
damn, that's gotta be hundreds of gb |
09:46
🔗
|
midas |
.. |
09:47
🔗
|
nico |
probably not |
09:52
🔗
|
ryonaloli |
but there are over a hundred snapshots, and the whole site is 7gb iirc |
09:52
🔗
|
midas |
well use the warc proxy. |
09:52
🔗
|
midas |
now you're getting all of the snapshots, all of them |
09:52
🔗
|
midas |
that's what you wanted. |
10:02
🔗
|
ryonaloli |
how will the warc proxy be different? |
10:02
🔗
|
ryonaloli |
and, i didn't want all of the snapshots. just all of the most recent one for each page |
11:51
🔗
|
fexx |
any plans to grab 800notes.com / other phone number indexing sites? |
11:53
🔗
|
ersi |
None what I know of, but anyone can do what they please. If it's interesting, feel free to take 'em on |
12:02
🔗
|
schbirid |
https://pay.reddit.com/r/opendirectories/comments/25002s/meta_a_tool_for_tree_mapping_remote_directories/ |
12:04
🔗
|
schbirid |
not very useful output, http://dirmap.krakissi.net/?path=https%3A%2F%2Fwww.quaddicted.com%2Ffiles%2Fmaps%2F |
12:05
🔗
|
midas |
so it has a open dir and loops it to find all files |
12:06
🔗
|
schbirid |
wget --spider -nv and some regexping is more suitable for people like us |
12:07
🔗
|
midas |
with the strange twich of downloading everything |
12:08
🔗
|
schbirid |
it does not download everything |
12:11
🔗
|
midas |
spider doesnt, but people like us do |
12:11
🔗
|
midas |
;-) |
12:11
🔗
|
schbirid |
>:) |
14:15
🔗
|
DFJustin |
wow you guys fail at reading comprehension |
14:15
🔗
|
DFJustin |
what he needs is https://code.google.com/p/warrick/ but he's gone now naturally |
14:38
🔗
|
SketchCow |
There you go. |
14:38
🔗
|
SketchCow |
The $15 thing didn't inspire me to keep going on. |
14:53
🔗
|
midas |
SketchCow: next time: http://archiveteam.org/index.php?title=Restoring (will add more data tonight, mostly made by DFJustin now) |
14:53
🔗
|
midas |
aka, all made by him atm :p |
14:56
🔗
|
SketchCow |
Yeah, then we won't have to break someone's back suggesting $15 |
14:58
🔗
|
midas |
well, we can just point |
15:36
🔗
|
balrog |
SketchCow: it doesn't inspire me either |
15:37
🔗
|
DFJustin |
it's a recurring problem so it's worth documenting |
15:41
🔗
|
SketchCow |
Agreed, absolutely. |
15:43
🔗
|
balrog |
I'd put a disclaimer saying we don't endorse that paid service though |
15:45
🔗
|
DFJustin |
you know what they say about wikis |
15:46
🔗
|
SketchCow |
Everybody's got one |
15:46
🔗
|
DFJustin |
that too |
15:53
🔗
|
SketchCow |
Using the internetarchive python interface. |
15:53
🔗
|
SketchCow |
Hardcore. |
15:53
🔗
|
SketchCow |
Running into bugs and limits, so you know I'm being cruel |
16:34
🔗
|
midas |
badass.py |
16:48
🔗
|
SketchCow |
https://archive.org/details/gg_Aerial_Assault_Rev_1_1992_Sega |
16:48
🔗
|
SketchCow |
Title, year and creator added by script. Cover and screenshot also. |
23:21
🔗
|
ivan` |
http://dealbook.nytimes.com/2014/05/08/delicious-social-site-is-sold-by-youtube-founders/ |