[04:33] Not sure if off-topic, but if this is relevant to current projects, I began personally archiving VGMusic.com's MIDIs and organizing them per game since there seems to be lacking a proper archive for its contents [04:40] cool [06:13] tef: from my experience running private bookmarking and caching/archiving services (we donated the CSS parsing code to wget), you increasingly need a "real browser" to do a good job of caching/archiving a page/site. [06:14] tef: I actually work off of three different "archives" for any site we cache: we take a screenshot, we cache it with wget, and we're working on caching a static representation as captured from within the browser [06:14] none can feed back into wayback machine yet, but it's on the to-do list [06:29] that's the curse of "Web 2.0" designs :-\ [06:47] The biggest problem right now are sites that use AJAX calls to dynamically load data. wget doesn't have a javascript interpreter engine, and when wget hits pages like this, it just sees an empty DIV and goes nowhere. [06:49] yeah, I expect to completely replace wget with phantomjs at some point [06:50] well, except for single-file mirroring, like a PDF or something [09:43] chronomex: if vgmusic.com is just midi files then i may look at archiving it so we have a full warc.gz of it [09:51] godane it pretty much is, to my knowledge [13:11] I am trying to download a site that uses assets on a subdomain. I used --span-hosts and --domains, but now it's making a duplicate copy of the site for the www. domain extension. I set -D to include tinypic.com so that it would download hotlinked images, but it seems to have downloaded some of the web pages from tinypic too. [13:20] AFAIK the images from tinypic are hosted on an subdomain [13:20] like i.tinypic.com or something like that [13:23] yes [13:24] But how do I tell it to not access ^tinypic.com and only access *.tinypic.com? [13:25] --domains and --exclude-domains don't appear to accept wildcards or regex [13:30] correct [13:49] How can I avoid downloading from the wrong domain then? [17:54] Hooray, oxing day [18:15] SketchCow: would it be useful to email you a list of magazines (searches/keywords) I uploaded so that when you have time you can create collections/darken/do whatever you like with them? [18:29] SketchCow: I'm up to 2011.08.31 of attack of the show [18:29] also i'm uploading vgmusic.com warc.gz right now [19:01] godane: Andriasang.com appears stable now, if you were still willing to try to grab a copy [19:08] i'm grabing it [19:09] i grabbed the articles [19:09] but the images i will have to try next [19:12] uploaded: http://archive.org/details/vgmusic.com-20121226-mirror [19:48] hiker1: http://archive.org/details/andriasang.com-articles-20121224-mirror [19:58] godane: What commands did you use to mirror the site/ [20:20] i made a index file first [20:20] using what command? [20:21] wget -x -i index.txt --warc-file=$website-articles-$(date +%Y%m%d) --warc-cdx -E -o wget.log [20:22] i had to do it this way cause there was way too many images to mirror it whole [20:22] You append http://andriasang.com to that command, right? [20:23] its all from http://andriasang.com [20:23] And will that grab the html articles, just not the images? [20:24] *I had to add http://andriasang.com to all urls since there local urls [20:25] I don't understand what you mean by that. How did you add it to all the urls? [20:25] with sed [20:26] to start, that first command grabs all the html files, correct? [20:26] when i may my index.txt file from a dump of the pages you get urls without http like this: /?date=2007-11-05 [20:26] and ignores images because you did not use --page-requisites [20:27] Am I correct in saying that? [20:28] i just grabbed what was listed in my index.txt [20:28] there is one image in there [20:28] http://andriasang.com/u/anoop/avatar_full.1351839050.jpg [20:28] Does running this command save html files, or just save an index? `wget -x -i index.txt --warc-file=$website-articles-$(date +%Y%m%d) --warc-cdx -E -o wget.log` [20:29] it saves html files [20:29] i got the index.txt file from another warc of the pages [20:29] ultraman cooking what [20:30] godane: Could you explain that? How did you get the index file to begin with? [20:32] i think i grabed it by: zcat *.warc.gz | grep -ohP 'href='[^'>]+' [20:33] i did this to my pages warc.gz [20:33] How'd you get the warc.gz to begin with? [20:34] for i in $(seq 1 895); do [20:34] echo "http://andriasang.com/?page=$i" >> index.txt [20:34] done [20:36] So that gives you a list of all the pages. How then did you get the warc.gz/index.txt with the full urls and with the urls by date? [20:36] i then did this: wget -x -i index.txt --warc-file=andrisasang.com-$(date +%Y%m%d) --warc-cdx -E -o wget.log [20:39] So you end up downloading the page listings twice in this process? [20:39] the first time to get all the urls, then the second time to get the real warc file with all the articles? [20:39] no [20:39] first time it was pages [20:40] then all urls of articles [20:40] Did you then merge the two together? [20:40] the dates and pages would also be in the articles dump to [20:40] oh, ok [20:41] How do you plan to get the images? [20:43] by grabing the urls like how i grabed the images [20:44] Will you then be able to merge the two warc files so that the images can be viewed in the articles? [20:45] the way back machine can handler multiable warcs [20:45] Can you use the wayback machine to read these from the web? Or do you mean by running a private copy of the wayback machine? [20:46] you can use warc-proxy to do it locally [20:46] and just load both warc files from that? [20:46] yes [20:48] Thank you for explaining this to me. I was having a hard time understand the process. I really appreciate the help. [22:14] godane: How do you handle grabbing CSS or images embedded in CSS? [22:28] i sadly don't know how to grab stuff in css [22:28] even with wget [22:28] cause i don't know if wget grabs urls in css [22:30] the requisites option maybe? [22:32] i can't grab the full website in one warc [22:39] Why can't you? [22:40] it was 2.8gb big and was still going when i was doing it the first time [22:40] is that too large for one wget? [22:41] 4gb is the limit on one warc.gz [22:41] it was getting there and it bothered me [22:41] oh. [22:43] there is over 317000+ images in that site [22:44] that's a few [22:44] yeah that'll add up [22:45] wow [22:46] i may have to do another grab later of images [22:46] What do you mean? [22:46] there was alot of images that had no folder/url path in it [22:46] it was just the file name [22:47] I thought you were only grabbing html files right now [22:47] html was already done [22:47] http://archive.org/details/andriasang.com-articles-20121224-mirror [22:47] thats the html articles [22:48] there was about 30 articles that gave the 502 bad gateway error [22:48] i was only able to get 4 of them on a retry [22:49] i limit the warc.gz file size to 1G