Time |
Nickname |
Message |
10:17
🔗
|
instence |
what do you guys do when trying to archive sites that could have say 500,000+ pages, or even over 1,000,000 in the case of a large forum? |
10:18
🔗
|
instence |
wget uses quite a bit of memory when doing recursive retrievals, or anything with -E or -p turned on |
10:19
🔗
|
instence |
I shouldn't say quite a bit. Rather, it uses the approriate amount to get the job done, which just happens to become larger and larger when dealing with big sites. |
10:20
🔗
|
omf_ |
I use httrack for large sites. It has much better memory management |
10:20
🔗
|
instence |
really, hmm |
10:20
🔗
|
instence |
the documentation was really poor for httrack last time i checked |
10:21
🔗
|
instence |
or rather it wasn't nearly as explanatory as wget |
10:21
🔗
|
instence |
a member of the community had written the doc, rather than the actual author of the app |
10:22
🔗
|
instence |
I have always avoided forums in my website archives since... they are just too big and would butcher my drives. However lately I might archive some, and just package them up on the server and not do any post processing work on them. |
10:23
🔗
|
instence |
Whatever I get is what I get |
10:23
🔗
|
instence |
the only issue though is when the forum grab starts getting duplicate stuff, like hitting jump links to individual posts in a phpBB2 forum. |
10:24
🔗
|
omf_ |
that is where the more advanced filtering in httrack comes in |
10:25
🔗
|
instence |
ah ok cool, I will have to take another stab at fully learning httrack when I decide to start hitting some big forums |
10:26
🔗
|
omf_ |
you can do domain, subdomain, file format, directory depth and regular expression matching with no limitations on how many rules you create |
10:27
🔗
|
instence |
one thing I have been loving lately is using a RAM disc to extract content for post processing and packaging |
10:28
🔗
|
instence |
ah yea I could have some use for that level of granularity, especially with sites that are like user.domain.com, and domain.com/user/, where the admin linked content in his HTML hard linked from either or |
10:28
🔗
|
instence |
the domain scoping in wget is just -D |
15:44
🔗
|
SmileyG |
joepie91: I want a MP10 powerhead and controller... only £200+! |
15:46
🔗
|
SmileyG |
errr wrong hcannel and person! |
15:53
🔗
|
joepie91 |
:P |
15:53
🔗
|
joepie91 |
SmileyG: classy |
16:15
🔗
|
dashcloud |
so, got a question for anyone else who has had an SSD die on them: did you get any kind of warning, or know it was dying before it died? |
18:09
🔗
|
Lord_Nigh |
dashcloud: never had one die myself. supposedly wear leveling on the intel ones is supposed to make them go read-only when they'd run out of spare sectors, but i don't know if that actually works or they lose the remap table sectors first |
18:09
🔗
|
Lord_Nigh |
which kills the ssd |
18:09
🔗
|
Lord_Nigh |
or more specifically is like losing the fat of a filesystem; the data is al there you just have no idea what order its supposed to be in |
18:27
🔗
|
dashcloud |
the first notice I had that something was wrong was turning the laptop on, and wondering why it's sitting at the logo screen for so long |
18:36
🔗
|
SmileyG |
heh |
18:36
🔗
|
SmileyG |
sucks, hope you got backup and this is why I don't trust ssd's yet. |
18:37
🔗
|
omf_ |
All hard drives fail and this is why frequent backups are necessary |
18:37
🔗
|
SmileyG |
yes but spinning rust has a generally well known failure style |
18:37
🔗
|
SmileyG |
unless you hit a power spike, or punch your PC. |
18:39
🔗
|
omf_ |
That said, for some hard drive failures checking the smart settings frequently can clue you into failures |
18:39
🔗
|
omf_ |
smart does not catch all problems but it is far better than what we used to have |
18:41
🔗
|
omf_ |
You still have the beginning of the bath tub curve failures which usually go undetected till they happen |
18:54
🔗
|
instence |
SmileyG, I have a VAIO Z, 3rd gen with all the trimmings. It is a powerhouse laptop, with a quad core i7 (desktop power, not low voltage cpu), 8GB of RAM, 1080p display that has 98% Adobe RGPs color gamut reproduction. The Power Media Dock that it connects to has a Radeon 7670M, USB 3.0 ports, can handle 4 connected displays, and the thing is so light it would blow your mind. |
18:55
🔗
|
SmileyG |
And? |
18:55
🔗
|
instence |
The SSD is proprietary Sony NAND Flash memory in a Raid 0 config, and might even be soldered to the motherboard. |
18:55
🔗
|
SmileyG |
I run gentoo and boot in 3 seconds |
18:55
🔗
|
SmileyG |
:D |
18:55
🔗
|
instence |
If that SSD dies... its a very very expensive brick. |
18:57
🔗
|
dashcloud |
it's actually light? I'd imagine a desktop replacement like that would weigh a considerable amount |
18:57
🔗
|
instence |
So I have been taking every precaution to minimize writes to the SSD itself. I have been trying to treat it as read only as possible, and push everything off to an external 2TB USB 3.0 HD, as well as using a 2GB RAM disc. |
18:57
🔗
|
dashcloud |
the good news about the SSD is it's still under warranty, so I'll get a replacement- still sucks having it die sudddenly, and needing to reinstall everything |
18:57
🔗
|
instence |
Its 2.5 lbs |
19:01
🔗
|
instence |
http://www.mobiletechreview.com/notebooks/Sony-Vaio-Z-2012.htm |
19:02
🔗
|
instence |
dashcloud: yea if you can replace the drive, then that is great, it would be crummy if the laptop ended up becoming unusable |
19:03
🔗
|
dashcloud |
running off of a live USB drive right now. Found Youtube's html5 player pretty good (better than I expected) |
19:04
🔗
|
instence |
If you can, carve out a section of your RAM Disk and use that to move tmp/temp dirs and partitions off the SSD, and also use it as scratch space to extract packages that might have thousands of files in them. |
19:11
🔗
|
instence |
So far I am mostly experienced with optimizing Windows 7 for minimizing SSD writes, and I have taken it really far. Moving everything from tmp/browser cache, to RDP Bitmap cache, killing office recent files to even killing all types of other unecessary writes like Beyond Compare's BCState.xml.tmp. |
19:12
🔗
|
instence |
If you have PowerISO running with no disc in the drive, it writes over 3,000 log entries per day telling you it can't find a disk in the drive lol |
19:12
🔗
|
instence |
But, I have started looking up some stuff for linux, and here is a good starting point: |
19:12
🔗
|
instence |
http://superuser.com/questions/228657/which-linux-filesystem-works-best-with-ssd |
19:13
🔗
|
instence |
I still need to find more links, but that url is pretty meaty |
19:18
🔗
|
instence |
and SmileyG: 3 sec boot is awesome :D nice |
19:36
🔗
|
yipdw |
dashcloud: no warning for me, but the SSD isn't actually dead |
19:36
🔗
|
yipdw |
dashcloud: it just pops in and out occasionally -- I suspect it's a controller problem |
19:36
🔗
|
yipdw |
I've only had problems with OCZ drives :P |
19:36
🔗
|
yipdw |
the Intel X25-Ms I've had for about three years now are still going fine |
19:37
🔗
|
omf_ |
how big a drive yipdw ? |
19:38
🔗
|
yipdw |
omf_: 240 GB |
19:39
🔗
|
yipdw |
I use it for ephemeral VMs and a Steam installation |
19:39
🔗
|
yipdw |
so it was a surprising non-event when it went :P |
19:39
🔗
|
yipdw |
was like "huh, ok" *reboot* "oh there it is" |
19:40
🔗
|
omf_ |
how many years did it last? |
19:40
🔗
|
yipdw |
less than one, though it's still working |
19:40
🔗
|
yipdw |
the X25-Ms just passed three |
19:41
🔗
|
omf_ |
I am trying to figure out the ideal size |
19:41
🔗
|
yipdw |
I've been okay with 64 GB drives |
19:41
🔗
|
yipdw |
that's on my laptop, though, which mostly hosts source code |
19:42
🔗
|
yipdw |
the desktop has two 80 GB SSDs as well as that 240 GB |
22:28
🔗
|
godane |
looks like i'm grabbing old articles of dailymail.co.uk that are really from femail.co.uk |
22:28
🔗
|
godane |
even the id number of the article is same |
22:54
🔗
|
Baljem |
yes, I think Femail is the Daily Mail's women's supplement |
22:54
🔗
|
Baljem |
or some such bullshit. I try and avoid the Mail as much as possible, lest my brains start dribbling out my ears |
22:55
🔗
|
Sellyme |
I don't mind ready Daily Mail articles, because I have AdBlock enabled on their site, so I'm costing them money. |
22:56
🔗
|
Sellyme |
Additionally, they can serve as reliable news. |
22:56
🔗
|
Sellyme |
Just assume that the opposite of whatever they say is true, and bam, reliable news |
22:57
🔗
|
godane |
i'm only going after the first 100000 articles |
22:58
🔗
|
godane |
there is over 2.5 million article ids to check |
22:58
🔗
|
godane |
and i don't want to do that much |
22:59
🔗
|
godane |
so the first 199 episodes of destructoid is uploaded |
23:00
🔗
|
godane |
i'm downloading the 2xx epsidoes right now |
23:00
🔗
|
godane |
also geekbrief tv is going to get uploaded |
23:01
🔗
|
godane |
i decide to use the basename of the video files |
23:02
🔗
|
Coderjoe |
one of my bosses is of the opinion that the Daily Fail is more truthful than mainstream. :-\ |
23:03
🔗
|
Coderjoe |
(he's also been (still is?) a truther. and it seems he's going down the conspiracy hole.) |
23:03
🔗
|
godane |
also out of 20000 ids there is only about 5500 that are real articles on the site |
23:04
🔗
|
godane |
truthers are a real nutty group |
23:06
🔗
|
godane |
also i think the truthers go there theory from a failed x-files spin off |
23:15
🔗
|
Aranje |
signs of an empire in decline imo |
23:15
🔗
|
godane |
i'm thinking the same thing with revision3 |
23:16
🔗
|
godane |
trying to grab like everything that i can from it |