#archiveteam-bs 2013-10-13,Sun

↑back Search

Time Nickname Message
10:17 🔗 instence what do you guys do when trying to archive sites that could have say 500,000+ pages, or even over 1,000,000 in the case of a large forum?
10:18 🔗 instence wget uses quite a bit of memory when doing recursive retrievals, or anything with -E or -p turned on
10:19 🔗 instence I shouldn't say quite a bit. Rather, it uses the approriate amount to get the job done, which just happens to become larger and larger when dealing with big sites.
10:20 🔗 omf_ I use httrack for large sites. It has much better memory management
10:20 🔗 instence really, hmm
10:20 🔗 instence the documentation was really poor for httrack last time i checked
10:21 🔗 instence or rather it wasn't nearly as explanatory as wget
10:21 🔗 instence a member of the community had written the doc, rather than the actual author of the app
10:22 🔗 instence I have always avoided forums in my website archives since... they are just too big and would butcher my drives. However lately I might archive some, and just package them up on the server and not do any post processing work on them.
10:23 🔗 instence Whatever I get is what I get
10:23 🔗 instence the only issue though is when the forum grab starts getting duplicate stuff, like hitting jump links to individual posts in a phpBB2 forum.
10:24 🔗 omf_ that is where the more advanced filtering in httrack comes in
10:25 🔗 instence ah ok cool, I will have to take another stab at fully learning httrack when I decide to start hitting some big forums
10:26 🔗 omf_ you can do domain, subdomain, file format, directory depth and regular expression matching with no limitations on how many rules you create
10:27 🔗 instence one thing I have been loving lately is using a RAM disc to extract content for post processing and packaging
10:28 🔗 instence ah yea I could have some use for that level of granularity, especially with sites that are like user.domain.com, and domain.com/user/, where the admin linked content in his HTML hard linked from either or
10:28 🔗 instence the domain scoping in wget is just -D
15:44 🔗 SmileyG joepie91: I want a MP10 powerhead and controller... only £200+!
15:46 🔗 SmileyG errr wrong hcannel and person!
15:53 🔗 joepie91 :P
15:53 🔗 joepie91 SmileyG: classy
16:15 🔗 dashcloud so, got a question for anyone else who has had an SSD die on them: did you get any kind of warning, or know it was dying before it died?
18:09 🔗 Lord_Nigh dashcloud: never had one die myself. supposedly wear leveling on the intel ones is supposed to make them go read-only when they'd run out of spare sectors, but i don't know if that actually works or they lose the remap table sectors first
18:09 🔗 Lord_Nigh which kills the ssd
18:09 🔗 Lord_Nigh or more specifically is like losing the fat of a filesystem; the data is al there you just have no idea what order its supposed to be in
18:27 🔗 dashcloud the first notice I had that something was wrong was turning the laptop on, and wondering why it's sitting at the logo screen for so long
18:36 🔗 SmileyG heh
18:36 🔗 SmileyG sucks, hope you got backup and this is why I don't trust ssd's yet.
18:37 🔗 omf_ All hard drives fail and this is why frequent backups are necessary
18:37 🔗 SmileyG yes but spinning rust has a generally well known failure style
18:37 🔗 SmileyG unless you hit a power spike, or punch your PC.
18:39 🔗 omf_ That said, for some hard drive failures checking the smart settings frequently can clue you into failures
18:39 🔗 omf_ smart does not catch all problems but it is far better than what we used to have
18:41 🔗 omf_ You still have the beginning of the bath tub curve failures which usually go undetected till they happen
18:54 🔗 instence SmileyG, I have a VAIO Z, 3rd gen with all the trimmings. It is a powerhouse laptop, with a quad core i7 (desktop power, not low voltage cpu), 8GB of RAM, 1080p display that has 98% Adobe RGPs color gamut reproduction. The Power Media Dock that it connects to has a Radeon 7670M, USB 3.0 ports, can handle 4 connected displays, and the thing is so light it would blow your mind.
18:55 🔗 SmileyG And?
18:55 🔗 instence The SSD is proprietary Sony NAND Flash memory in a Raid 0 config, and might even be soldered to the motherboard.
18:55 🔗 SmileyG I run gentoo and boot in 3 seconds
18:55 🔗 SmileyG :D
18:55 🔗 instence If that SSD dies... its a very very expensive brick.
18:57 🔗 dashcloud it's actually light? I'd imagine a desktop replacement like that would weigh a considerable amount
18:57 🔗 instence So I have been taking every precaution to minimize writes to the SSD itself. I have been trying to treat it as read only as possible, and push everything off to an external 2TB USB 3.0 HD, as well as using a 2GB RAM disc.
18:57 🔗 dashcloud the good news about the SSD is it's still under warranty, so I'll get a replacement- still sucks having it die sudddenly, and needing to reinstall everything
18:57 🔗 instence Its 2.5 lbs
19:01 🔗 instence http://www.mobiletechreview.com/notebooks/Sony-Vaio-Z-2012.htm
19:02 🔗 instence dashcloud: yea if you can replace the drive, then that is great, it would be crummy if the laptop ended up becoming unusable
19:03 🔗 dashcloud running off of a live USB drive right now. Found Youtube's html5 player pretty good (better than I expected)
19:04 🔗 instence If you can, carve out a section of your RAM Disk and use that to move tmp/temp dirs and partitions off the SSD, and also use it as scratch space to extract packages that might have thousands of files in them.
19:11 🔗 instence So far I am mostly experienced with optimizing Windows 7 for minimizing SSD writes, and I have taken it really far. Moving everything from tmp/browser cache, to RDP Bitmap cache, killing office recent files to even killing all types of other unecessary writes like Beyond Compare's BCState.xml.tmp.
19:12 🔗 instence If you have PowerISO running with no disc in the drive, it writes over 3,000 log entries per day telling you it can't find a disk in the drive lol
19:12 🔗 instence But, I have started looking up some stuff for linux, and here is a good starting point:
19:12 🔗 instence http://superuser.com/questions/228657/which-linux-filesystem-works-best-with-ssd
19:13 🔗 instence I still need to find more links, but that url is pretty meaty
19:18 🔗 instence and SmileyG: 3 sec boot is awesome :D nice
19:36 🔗 yipdw dashcloud: no warning for me, but the SSD isn't actually dead
19:36 🔗 yipdw dashcloud: it just pops in and out occasionally -- I suspect it's a controller problem
19:36 🔗 yipdw I've only had problems with OCZ drives :P
19:36 🔗 yipdw the Intel X25-Ms I've had for about three years now are still going fine
19:37 🔗 omf_ how big a drive yipdw ?
19:38 🔗 yipdw omf_: 240 GB
19:39 🔗 yipdw I use it for ephemeral VMs and a Steam installation
19:39 🔗 yipdw so it was a surprising non-event when it went :P
19:39 🔗 yipdw was like "huh, ok" *reboot* "oh there it is"
19:40 🔗 omf_ how many years did it last?
19:40 🔗 yipdw less than one, though it's still working
19:40 🔗 yipdw the X25-Ms just passed three
19:41 🔗 omf_ I am trying to figure out the ideal size
19:41 🔗 yipdw I've been okay with 64 GB drives
19:41 🔗 yipdw that's on my laptop, though, which mostly hosts source code
19:42 🔗 yipdw the desktop has two 80 GB SSDs as well as that 240 GB
22:28 🔗 godane looks like i'm grabbing old articles of dailymail.co.uk that are really from femail.co.uk
22:28 🔗 godane even the id number of the article is same
22:54 🔗 Baljem yes, I think Femail is the Daily Mail's women's supplement
22:54 🔗 Baljem or some such bullshit. I try and avoid the Mail as much as possible, lest my brains start dribbling out my ears
22:55 🔗 Sellyme I don't mind ready Daily Mail articles, because I have AdBlock enabled on their site, so I'm costing them money.
22:56 🔗 Sellyme Additionally, they can serve as reliable news.
22:56 🔗 Sellyme Just assume that the opposite of whatever they say is true, and bam, reliable news
22:57 🔗 godane i'm only going after the first 100000 articles
22:58 🔗 godane there is over 2.5 million article ids to check
22:58 🔗 godane and i don't want to do that much
22:59 🔗 godane so the first 199 episodes of destructoid is uploaded
23:00 🔗 godane i'm downloading the 2xx epsidoes right now
23:00 🔗 godane also geekbrief tv is going to get uploaded
23:01 🔗 godane i decide to use the basename of the video files
23:02 🔗 Coderjoe one of my bosses is of the opinion that the Daily Fail is more truthful than mainstream. :-\
23:03 🔗 Coderjoe (he's also been (still is?) a truther. and it seems he's going down the conspiracy hole.)
23:03 🔗 godane also out of 20000 ids there is only about 5500 that are real articles on the site
23:04 🔗 godane truthers are a real nutty group
23:06 🔗 godane also i think the truthers go there theory from a failed x-files spin off
23:15 🔗 Aranje signs of an empire in decline imo
23:15 🔗 godane i'm thinking the same thing with revision3
23:16 🔗 godane trying to grab like everything that i can from it

irclogger-viewer