Time |
Nickname |
Message |
04:33
🔗
|
ArtimusAg |
Not sure if off-topic, but if this is relevant to current projects, I began personally archiving VGMusic.com's MIDIs and organizing them per game since there seems to be lacking a proper archive for its contents |
04:40
🔗
|
chronomex |
cool |
06:13
🔗
|
Vito`` |
tef: from my experience running private bookmarking and caching/archiving services (we donated the CSS parsing code to wget), you increasingly need a "real browser" to do a good job of caching/archiving a page/site. |
06:14
🔗
|
Vito`` |
tef: I actually work off of three different "archives" for any site we cache: we take a screenshot, we cache it with wget, and we're working on caching a static representation as captured from within the browser |
06:14
🔗
|
Vito`` |
none can feed back into wayback machine yet, but it's on the to-do list |
06:29
🔗
|
Coderjoe |
that's the curse of "Web 2.0" designs :-\ |
06:47
🔗
|
instence |
The biggest problem right now are sites that use AJAX calls to dynamically load data. wget doesn't have a javascript interpreter engine, and when wget hits pages like this, it just sees an empty DIV and goes nowhere. |
06:49
🔗
|
Vito`` |
yeah, I expect to completely replace wget with phantomjs at some point |
06:50
🔗
|
Vito`` |
well, except for single-file mirroring, like a PDF or something |
09:43
🔗
|
godane |
chronomex: if vgmusic.com is just midi files then i may look at archiving it so we have a full warc.gz of it |
09:51
🔗
|
BlueMaxim |
godane it pretty much is, to my knowledge |
13:11
🔗
|
hiker1 |
I am trying to download a site that uses assets on a subdomain. I used --span-hosts and --domains, but now it's making a duplicate copy of the site for the www. domain extension. I set -D to include tinypic.com so that it would download hotlinked images, but it seems to have downloaded some of the web pages from tinypic too. |
13:20
🔗
|
ersi |
AFAIK the images from tinypic are hosted on an subdomain |
13:20
🔗
|
ersi |
like i.tinypic.com or something like that |
13:23
🔗
|
hiker1 |
yes |
13:24
🔗
|
hiker1 |
But how do I tell it to not access ^tinypic.com and only access *.tinypic.com? |
13:25
🔗
|
hiker1 |
--domains and --exclude-domains don't appear to accept wildcards or regex |
13:30
🔗
|
schbirid1 |
correct |
13:49
🔗
|
hiker1 |
How can I avoid downloading from the wrong domain then? |
17:54
🔗
|
SketchCow |
Hooray, oxing day |
18:15
🔗
|
Nemo_bis |
SketchCow: would it be useful to email you a list of magazines (searches/keywords) I uploaded so that when you have time you can create collections/darken/do whatever you like with them? |
18:29
🔗
|
godane |
SketchCow: I'm up to 2011.08.31 of attack of the show |
18:29
🔗
|
godane |
also i'm uploading vgmusic.com warc.gz right now |
19:01
🔗
|
hiker1 |
godane: Andriasang.com appears stable now, if you were still willing to try to grab a copy |
19:08
🔗
|
godane |
i'm grabing it |
19:09
🔗
|
godane |
i grabbed the articles |
19:09
🔗
|
godane |
but the images i will have to try next |
19:12
🔗
|
godane |
uploaded: http://archive.org/details/vgmusic.com-20121226-mirror |
19:48
🔗
|
godane |
hiker1: http://archive.org/details/andriasang.com-articles-20121224-mirror |
19:58
🔗
|
hiker1 |
godane: What commands did you use to mirror the site/ |
20:20
🔗
|
godane |
i made a index file first |
20:20
🔗
|
hiker1 |
using what command? |
20:21
🔗
|
godane |
wget -x -i index.txt --warc-file=$website-articles-$(date +%Y%m%d) --warc-cdx -E -o wget.log |
20:22
🔗
|
godane |
i had to do it this way cause there was way too many images to mirror it whole |
20:22
🔗
|
hiker1 |
You append http://andriasang.com to that command, right? |
20:23
🔗
|
godane |
its all from http://andriasang.com |
20:23
🔗
|
hiker1 |
And will that grab the html articles, just not the images? |
20:24
🔗
|
godane |
*I had to add http://andriasang.com to all urls since there local urls |
20:25
🔗
|
hiker1 |
I don't understand what you mean by that. How did you add it to all the urls? |
20:25
🔗
|
godane |
with sed |
20:26
🔗
|
hiker1 |
to start, that first command grabs all the html files, correct? |
20:26
🔗
|
godane |
when i may my index.txt file from a dump of the pages you get urls without http like this: /?date=2007-11-05 |
20:26
🔗
|
hiker1 |
and ignores images because you did not use --page-requisites |
20:27
🔗
|
hiker1 |
Am I correct in saying that? |
20:28
🔗
|
godane |
i just grabbed what was listed in my index.txt |
20:28
🔗
|
godane |
there is one image in there |
20:28
🔗
|
godane |
http://andriasang.com/u/anoop/avatar_full.1351839050.jpg |
20:28
🔗
|
hiker1 |
Does running this command save html files, or just save an index? `wget -x -i index.txt --warc-file=$website-articles-$(date +%Y%m%d) --warc-cdx -E -o wget.log` |
20:29
🔗
|
godane |
it saves html files |
20:29
🔗
|
godane |
i got the index.txt file from another warc of the pages |
20:29
🔗
|
DFJustin |
ultraman cooking what |
20:30
🔗
|
hiker1 |
godane: Could you explain that? How did you get the index file to begin with? |
20:32
🔗
|
godane |
i think i grabed it by: zcat *.warc.gz | grep -ohP 'href='[^'>]+' |
20:33
🔗
|
godane |
i did this to my pages warc.gz |
20:33
🔗
|
hiker1 |
How'd you get the warc.gz to begin with? |
20:34
🔗
|
godane |
for i in $(seq 1 895); do |
20:34
🔗
|
godane |
echo "http://andriasang.com/?page=$i" >> index.txt |
20:34
🔗
|
godane |
done |
20:36
🔗
|
hiker1 |
So that gives you a list of all the pages. How then did you get the warc.gz/index.txt with the full urls and with the urls by date? |
20:36
🔗
|
godane |
i then did this: wget -x -i index.txt --warc-file=andrisasang.com-$(date +%Y%m%d) --warc-cdx -E -o wget.log |
20:39
🔗
|
hiker1 |
So you end up downloading the page listings twice in this process? |
20:39
🔗
|
hiker1 |
the first time to get all the urls, then the second time to get the real warc file with all the articles? |
20:39
🔗
|
godane |
no |
20:39
🔗
|
godane |
first time it was pages |
20:40
🔗
|
godane |
then all urls of articles |
20:40
🔗
|
hiker1 |
Did you then merge the two together? |
20:40
🔗
|
godane |
the dates and pages would also be in the articles dump to |
20:40
🔗
|
hiker1 |
oh, ok |
20:41
🔗
|
hiker1 |
How do you plan to get the images? |
20:43
🔗
|
godane |
by grabing the urls like how i grabed the images |
20:44
🔗
|
hiker1 |
Will you then be able to merge the two warc files so that the images can be viewed in the articles? |
20:45
🔗
|
godane |
the way back machine can handler multiable warcs |
20:45
🔗
|
hiker1 |
Can you use the wayback machine to read these from the web? Or do you mean by running a private copy of the wayback machine? |
20:46
🔗
|
godane |
you can use warc-proxy to do it locally |
20:46
🔗
|
hiker1 |
and just load both warc files from that? |
20:46
🔗
|
godane |
yes |
20:48
🔗
|
hiker1 |
Thank you for explaining this to me. I was having a hard time understand the process. I really appreciate the help. |
22:14
🔗
|
hiker1 |
godane: How do you handle grabbing CSS or images embedded in CSS? |
22:28
🔗
|
godane |
i sadly don't know how to grab stuff in css |
22:28
🔗
|
godane |
even with wget |
22:28
🔗
|
godane |
cause i don't know if wget grabs urls in css |
22:30
🔗
|
Nemo_bis |
the requisites option maybe? |
22:32
🔗
|
godane |
i can't grab the full website in one warc |
22:39
🔗
|
hiker1 |
Why can't you? |
22:40
🔗
|
godane |
it was 2.8gb big and was still going when i was doing it the first time |
22:40
🔗
|
hiker1 |
is that too large for one wget? |
22:41
🔗
|
godane |
4gb is the limit on one warc.gz |
22:41
🔗
|
godane |
it was getting there and it bothered me |
22:41
🔗
|
hiker1 |
oh. |
22:43
🔗
|
godane |
there is over 317000+ images in that site |
22:44
🔗
|
ersi |
that's a few |
22:44
🔗
|
chronomex |
yeah that'll add up |
22:45
🔗
|
hiker1 |
wow |
22:46
🔗
|
godane |
i may have to do another grab later of images |
22:46
🔗
|
hiker1 |
What do you mean? |
22:46
🔗
|
godane |
there was alot of images that had no folder/url path in it |
22:46
🔗
|
godane |
it was just the file name |
22:47
🔗
|
hiker1 |
I thought you were only grabbing html files right now |
22:47
🔗
|
godane |
html was already done |
22:47
🔗
|
godane |
http://archive.org/details/andriasang.com-articles-20121224-mirror |
22:47
🔗
|
godane |
thats the html articles |
22:48
🔗
|
godane |
there was about 30 articles that gave the 502 bad gateway error |
22:48
🔗
|
godane |
i was only able to get 4 of them on a retry |
22:49
🔗
|
godane |
i limit the warc.gz file size to 1G |