#archiveteam 2012-12-26,Wed

↑back Search

Time Nickname Message
04:33 🔗 ArtimusAg Not sure if off-topic, but if this is relevant to current projects, I began personally archiving VGMusic.com's MIDIs and organizing them per game since there seems to be lacking a proper archive for its contents
04:40 🔗 chronomex cool
06:13 🔗 Vito`` tef: from my experience running private bookmarking and caching/archiving services (we donated the CSS parsing code to wget), you increasingly need a "real browser" to do a good job of caching/archiving a page/site.
06:14 🔗 Vito`` tef: I actually work off of three different "archives" for any site we cache: we take a screenshot, we cache it with wget, and we're working on caching a static representation as captured from within the browser
06:14 🔗 Vito`` none can feed back into wayback machine yet, but it's on the to-do list
06:29 🔗 Coderjoe that's the curse of "Web 2.0" designs :-\
06:47 🔗 instence The biggest problem right now are sites that use AJAX calls to dynamically load data. wget doesn't have a javascript interpreter engine, and when wget hits pages like this, it just sees an empty DIV and goes nowhere.
06:49 🔗 Vito`` yeah, I expect to completely replace wget with phantomjs at some point
06:50 🔗 Vito`` well, except for single-file mirroring, like a PDF or something
09:43 🔗 godane chronomex: if vgmusic.com is just midi files then i may look at archiving it so we have a full warc.gz of it
09:51 🔗 BlueMaxim godane it pretty much is, to my knowledge
13:11 🔗 hiker1 I am trying to download a site that uses assets on a subdomain. I used --span-hosts and --domains, but now it's making a duplicate copy of the site for the www. domain extension. I set -D to include tinypic.com so that it would download hotlinked images, but it seems to have downloaded some of the web pages from tinypic too.
13:20 🔗 ersi AFAIK the images from tinypic are hosted on an subdomain
13:20 🔗 ersi like i.tinypic.com or something like that
13:23 🔗 hiker1 yes
13:24 🔗 hiker1 But how do I tell it to not access ^tinypic.com and only access *.tinypic.com?
13:25 🔗 hiker1 --domains and --exclude-domains don't appear to accept wildcards or regex
13:30 🔗 schbirid1 correct
13:49 🔗 hiker1 How can I avoid downloading from the wrong domain then?
17:54 🔗 SketchCow Hooray, oxing day
18:15 🔗 Nemo_bis SketchCow: would it be useful to email you a list of magazines (searches/keywords) I uploaded so that when you have time you can create collections/darken/do whatever you like with them?
18:29 🔗 godane SketchCow: I'm up to 2011.08.31 of attack of the show
18:29 🔗 godane also i'm uploading vgmusic.com warc.gz right now
19:01 🔗 hiker1 godane: Andriasang.com appears stable now, if you were still willing to try to grab a copy
19:08 🔗 godane i'm grabing it
19:09 🔗 godane i grabbed the articles
19:09 🔗 godane but the images i will have to try next
19:12 🔗 godane uploaded: http://archive.org/details/vgmusic.com-20121226-mirror
19:48 🔗 godane hiker1: http://archive.org/details/andriasang.com-articles-20121224-mirror
19:58 🔗 hiker1 godane: What commands did you use to mirror the site/
20:20 🔗 godane i made a index file first
20:20 🔗 hiker1 using what command?
20:21 🔗 godane wget -x -i index.txt --warc-file=$website-articles-$(date +%Y%m%d) --warc-cdx -E -o wget.log
20:22 🔗 godane i had to do it this way cause there was way too many images to mirror it whole
20:22 🔗 hiker1 You append http://andriasang.com to that command, right?
20:23 🔗 godane its all from http://andriasang.com
20:23 🔗 hiker1 And will that grab the html articles, just not the images?
20:24 🔗 godane *I had to add http://andriasang.com to all urls since there local urls
20:25 🔗 hiker1 I don't understand what you mean by that. How did you add it to all the urls?
20:25 🔗 godane with sed
20:26 🔗 hiker1 to start, that first command grabs all the html files, correct?
20:26 🔗 godane when i may my index.txt file from a dump of the pages you get urls without http like this: /?date=2007-11-05
20:26 🔗 hiker1 and ignores images because you did not use --page-requisites
20:27 🔗 hiker1 Am I correct in saying that?
20:28 🔗 godane i just grabbed what was listed in my index.txt
20:28 🔗 godane there is one image in there
20:28 🔗 godane http://andriasang.com/u/anoop/avatar_full.1351839050.jpg
20:28 🔗 hiker1 Does running this command save html files, or just save an index? `wget -x -i index.txt --warc-file=$website-articles-$(date +%Y%m%d) --warc-cdx -E -o wget.log`
20:29 🔗 godane it saves html files
20:29 🔗 godane i got the index.txt file from another warc of the pages
20:29 🔗 DFJustin ultraman cooking what
20:30 🔗 hiker1 godane: Could you explain that? How did you get the index file to begin with?
20:32 🔗 godane i think i grabed it by: zcat *.warc.gz | grep -ohP 'href='[^'>]+'
20:33 🔗 godane i did this to my pages warc.gz
20:33 🔗 hiker1 How'd you get the warc.gz to begin with?
20:34 🔗 godane for i in $(seq 1 895); do
20:34 🔗 godane echo "http://andriasang.com/?page=$i" >> index.txt
20:34 🔗 godane done
20:36 🔗 hiker1 So that gives you a list of all the pages. How then did you get the warc.gz/index.txt with the full urls and with the urls by date?
20:36 🔗 godane i then did this: wget -x -i index.txt --warc-file=andrisasang.com-$(date +%Y%m%d) --warc-cdx -E -o wget.log
20:39 🔗 hiker1 So you end up downloading the page listings twice in this process?
20:39 🔗 hiker1 the first time to get all the urls, then the second time to get the real warc file with all the articles?
20:39 🔗 godane no
20:39 🔗 godane first time it was pages
20:40 🔗 godane then all urls of articles
20:40 🔗 hiker1 Did you then merge the two together?
20:40 🔗 godane the dates and pages would also be in the articles dump to
20:40 🔗 hiker1 oh, ok
20:41 🔗 hiker1 How do you plan to get the images?
20:43 🔗 godane by grabing the urls like how i grabed the images
20:44 🔗 hiker1 Will you then be able to merge the two warc files so that the images can be viewed in the articles?
20:45 🔗 godane the way back machine can handler multiable warcs
20:45 🔗 hiker1 Can you use the wayback machine to read these from the web? Or do you mean by running a private copy of the wayback machine?
20:46 🔗 godane you can use warc-proxy to do it locally
20:46 🔗 hiker1 and just load both warc files from that?
20:46 🔗 godane yes
20:48 🔗 hiker1 Thank you for explaining this to me. I was having a hard time understand the process. I really appreciate the help.
22:14 🔗 hiker1 godane: How do you handle grabbing CSS or images embedded in CSS?
22:28 🔗 godane i sadly don't know how to grab stuff in css
22:28 🔗 godane even with wget
22:28 🔗 godane cause i don't know if wget grabs urls in css
22:30 🔗 Nemo_bis the requisites option maybe?
22:32 🔗 godane i can't grab the full website in one warc
22:39 🔗 hiker1 Why can't you?
22:40 🔗 godane it was 2.8gb big and was still going when i was doing it the first time
22:40 🔗 hiker1 is that too large for one wget?
22:41 🔗 godane 4gb is the limit on one warc.gz
22:41 🔗 godane it was getting there and it bothered me
22:41 🔗 hiker1 oh.
22:43 🔗 godane there is over 317000+ images in that site
22:44 🔗 ersi that's a few
22:44 🔗 chronomex yeah that'll add up
22:45 🔗 hiker1 wow
22:46 🔗 godane i may have to do another grab later of images
22:46 🔗 hiker1 What do you mean?
22:46 🔗 godane there was alot of images that had no folder/url path in it
22:46 🔗 godane it was just the file name
22:47 🔗 hiker1 I thought you were only grabbing html files right now
22:47 🔗 godane html was already done
22:47 🔗 godane http://archive.org/details/andriasang.com-articles-20121224-mirror
22:47 🔗 godane thats the html articles
22:48 🔗 godane there was about 30 articles that gave the 502 bad gateway error
22:48 🔗 godane i was only able to get 4 of them on a retry
22:49 🔗 godane i limit the warc.gz file size to 1G

irclogger-viewer