#archiveteam-ot 2019-12-19,Thu

↑back Search

Time	Nickname	Message
00:07 ^🔗		girst has quit IRC (Remote host closed the connection)
01:07 ^🔗		bithippo has quit IRC (Textual IRC Client: www.textualapp.com)
01:41 ^🔗		synm0nger has quit IRC (Quit: Wait, what?)
01:57 ^🔗		SynMonger has joined #archiveteam-ot
02:11 ^🔗		DogsRNice has quit IRC (Read error: Connection reset by peer)
02:12 ^🔗		cerca has joined #archiveteam-ot
03:19 ^🔗		icedice has joined #archiveteam-ot
03:21 ^🔗		jamiew has joined #archiveteam-ot
03:30 ^🔗		SoraUta has joined #archiveteam-ot
03:37 ^🔗		jamiew has quit IRC (Textual IRC Client: www.textualapp.com)
03:42 ^🔗		cerca has quit IRC (Remote host closed the connection)
04:01 ^🔗		icedice has quit IRC (Read error: Operation timed out)
04:51 ^🔗		qw3rty has joined #archiveteam-ot
05:00 ^🔗		qw3rty2 has quit IRC (Ping timeout: 745 seconds)
05:14 ^🔗		markedL has joined #archiveteam-ot
05:15 ^🔗	markedL	do people burn in new hard drives before putting data on them?
05:17 ^🔗	kpcyrd	I don't, now I'm wondering if I should
05:27 ^🔗	Frogging	I run badblocks
05:28 ^🔗	Frogging	It writes to every byte on the device and then reads the whole device
05:29 ^🔗		godane has quit IRC (Ping timeout: 745 seconds)
05:31 ^🔗	Frogging	("writing to every byte" is an oversimplification because of sectors and all that, but you get the idea)
05:32 ^🔗	Frogging	I don't know if there's any point to doing so, but I don't see any reason not to. If the drive fails after a few passes of that, you've saved yourself some trouble later on.
05:33 ^🔗	Frogging	it'll also find any bad sectors; that's actually what badblocks is designed to do
05:34 ^🔗	Frogging	I'm sure they do that at the factory already, but still, why not do it again? Consider it a pre-formatting scrub
05:40 ^🔗		godane has joined #archiveteam-ot
05:51 ^🔗		tuluu has quit IRC (Remote host closed the connection)
05:52 ^🔗		tuluu has joined #archiveteam-ot
06:26 ^🔗		Flashfire has quit IRC (Read error: Connection reset by peer)
06:27 ^🔗		Flashfire has joined #archiveteam-ot
07:08 ^🔗		deevious has joined #archiveteam-ot
07:12 ^🔗		dhyan_nat has joined #archiveteam-ot
08:11 ^🔗		deevious has quit IRC (Quit: deevious)
08:15 ^🔗		deevious has joined #archiveteam-ot
08:34 ^🔗		ShellyRol has quit IRC (Read error: Connection reset by peer)
08:37 ^🔗		bluefoo has quit IRC (Read error: Operation timed out)
08:51 ^🔗		ShellyRol has joined #archiveteam-ot
09:04 ^🔗		SoraUta has quit IRC (Ping timeout: 610 seconds)
09:39 ^🔗		dhyan_nat has quit IRC (Read error: Operation timed out)
09:54 ^🔗		Laverne has quit IRC (Ping timeout: 258 seconds)
09:54 ^🔗		mls has quit IRC (Ping timeout: 258 seconds)
09:54 ^🔗		VoynichCr has quit IRC (Ping timeout: 258 seconds)
09:54 ^🔗		sHATNER has quit IRC (Ping timeout: 258 seconds)
09:55 ^🔗		eythian has quit IRC (Ping timeout: 258 seconds)
09:55 ^🔗		luckcolor has quit IRC (Ping timeout: 258 seconds)
09:55 ^🔗		luckcolor has joined #archiveteam-ot
09:57 ^🔗		eythian has joined #archiveteam-ot
09:59 ^🔗		mls has joined #archiveteam-ot
10:00 ^🔗		sHATNER has joined #archiveteam-ot
10:26 ^🔗		deevious has quit IRC (Quit: deevious)
10:31 ^🔗		BlueMaxim has joined #archiveteam-ot
10:43 ^🔗		BlueMax has quit IRC (Ping timeout: 745 seconds)
10:49 ^🔗		deevious has joined #archiveteam-ot
10:59 ^🔗		VoynichCr has joined #archiveteam-ot
11:00 ^🔗		Laverne has joined #archiveteam-ot
12:23 ^🔗		BlueMaxim has quit IRC (Read error: Connection reset by peer)
12:47 ^🔗		bluefoo has joined #archiveteam-ot
13:20 ^🔗		jamiew has joined #archiveteam-ot
13:21 ^🔗		SoraUta has joined #archiveteam-ot
13:24 ^🔗		jamiew has quit IRC (Client Quit)
13:43 ^🔗	JAA	Yeah, I also do essentially that (with SMART long test if available), and I also use fio to stress-test the mechanics for a few hours.
13:55 ^🔗		tuluu has quit IRC (Read error: Connection refused)
13:55 ^🔗		tuluu has joined #archiveteam-ot
14:45 ^🔗		bluefoo has quit IRC (Ping timeout: 744 seconds)
14:50 ^🔗		bluefoo has joined #archiveteam-ot
15:07 ^🔗		girst has joined #archiveteam-ot
15:15 ^🔗		deevious has quit IRC (Quit: deevious)
15:38 ^🔗		mc2 has quit IRC (Read error: Operation timed out)
15:58 ^🔗		dhyan_nat has joined #archiveteam-ot
16:41 ^🔗		jamiew has joined #archiveteam-ot
17:01 ^🔗		jamiew has quit IRC (Textual IRC Client: www.textualapp.com)
17:34 ^🔗		jamiew has joined #archiveteam-ot
17:56 ^🔗	ivan	markedL: I do a write-read test using my http://github.com/ludios/drive-checker which has never actually caught a problem on a new drive for me (they tend to be tested before they get shipped out)
17:56 ^🔗	ivan	but it does catch memory problems on the computer I use it on :-)
17:58 ^🔗	markedL	I guess I'm equally concerned about damage from shipping or improper storage. they likely worked when it left the factory.
18:00 ^🔗	ivan	that will show up as 'very DOA'
18:01 ^🔗	ivan	noises, fail to spin up
18:09 ^🔗		qw3rty has quit IRC (Ping timeout: 745 seconds)
18:13 ^🔗		qw3rty has joined #archiveteam-ot
18:20 ^🔗		VerifiedJ has joined #archiveteam-ot
18:29 ^🔗		X-Scale` has joined #archiveteam-ot
18:29 ^🔗		LowLevelM has quit IRC (Read error: Operation timed out)
18:30 ^🔗		LowLevelM has joined #archiveteam-ot
18:34 ^🔗		X-Scale has quit IRC (Ping timeout: 610 seconds)
18:34 ^🔗		X-Scale` is now known as X-Scale
20:19 ^🔗		jamiew_ has joined #archiveteam-ot
20:20 ^🔗		MilkGames has joined #archiveteam-ot
20:20 ^🔗		jamiew_ has quit IRC (Client Quit)
20:21 ^🔗	MilkGames	Hey there, how would I go about getting a web archive moved to the ArchiveTeam collection?
20:33 ^🔗	MilkGames	Sorry, just realised this is the wrong channel to ask that in. I'll ask in another.
20:33 ^🔗		MilkGames has left
20:43 ^🔗		DogsRNice has joined #archiveteam-ot
21:06 ^🔗		jamiew_ has joined #archiveteam-ot
21:10 ^🔗		oxguy3 has joined #archiveteam-ot
21:16 ^🔗	oxguy3	uh hey, so i've got 204GB of gzipped WARC files from an FTP site... is there anything i should know before i attempt to upload this to archive.org with the ia command line tool? i've never uploaded anything remotely this big
21:25 ^🔗	astrid	you might want to split it into a handful of distinct ia items, depending. up to you though.
21:25 ^🔗	astrid	how big is each warc?
21:27 ^🔗	oxguy3	i set wget to target 1GB file size, but most are a bit bigger, and some are huge -- got a 12gb and an 11gb
21:27 ^🔗	astrid	oh that's reasonable
21:28 ^🔗	astrid	:)
21:29 ^🔗	oxguy3	yeah it's not too bad, and i figure it'd be best to keep them together so it's a singular package (i archived the entire FTP server lol)
21:29 ^🔗	oxguy3	i guess i'll get to uploading... this is gonna take a while lol
21:31 ^🔗	Kaz	split to about 50gb/item ideally, I think is the common recommendation
21:33 ^🔗	oxguy3	i thought it was 50 gb/file and 1000 files/item? https://help.archive.org/hc/en-us/articles/360016475032-Uploading-Tips
21:37 ^🔗	oxguy3	oh wait, looks like items aren't supposed to be bigger than 100gb, shoot https://archive.org/services/docs/api/items.html#item-limitations
21:37 ^🔗	oxguy3	alright if i'm uploading this into multiple items, does it matter which item i put the meta warc file in?
21:40 ^🔗	Kaz	yeah, don't make a 50tb item :)
21:40 ^🔗	Kaz	I'm pretty sure it wouldn't be possible anyway, as I think all files for an item live together on a disk
21:54 ^🔗	oxguy3	i'm assuming i should set mediatype:web since this is WARC files, even though it's ftp rather than http, right?
22:03 ^🔗		jamiew__ has joined #archiveteam-ot
22:08 ^🔗	SketchCow	Also, unless you're an approved archive team project, it won't go into wayback
22:09 ^🔗	oxguy3	yeah i figured, but will it still be browseable on archive.org?
22:09 ^🔗	SketchCow	The item? Sure
22:09 ^🔗		jamiew_ has quit IRC (Read error: Operation timed out)
22:09 ^🔗	oxguy3	like as in, would you be able to browse the full contents of the server in some easy way, instead of having to dig through warc.gz files?
22:09 ^🔗	SketchCow	Nope
22:10 ^🔗	oxguy3	ah, hmm. would it be better if i just uploaded the actual raw files instead of the WARCs? (i didnt include --delete-after in my wget command so i have them raw as well)
22:10 ^🔗	SketchCow	I don't know how more or not more better it is.
22:11 ^🔗	SketchCow	What FTP site is it
22:11 ^🔗	SketchCow	Dare you to say Intel
22:11 ^🔗	oxguy3	vikings.flashspot.tv -- the Minnesota Vikings used it to share video and photos with the press for many years
22:11 ^🔗	SketchCow	Is it still up
22:12 ^🔗	oxguy3	yes, but hasn't been updated since 2017
22:12 ^🔗	SketchCow	Just pass this to archivebot to do it
22:12 ^🔗	SketchCow	Then it's all handled and it goes in wayback
22:13 ^🔗	oxguy3	it requires a login, i wasn't sure if that would be an issue
22:14 ^🔗	SketchCow	Upload the raw files.
22:14 ^🔗	SketchCow	How many files is it.
22:14 ^🔗	oxguy3	okay cool
22:14 ^🔗	oxguy3	uhhh, a lot... let me see
22:14 ^🔗	SketchCow	Either way, it's going to be a nightmare
22:14 ^🔗	oxguy3	33462
22:15 ^🔗	SketchCow	Yeah, raw files, have a ball
22:15 ^🔗	SketchCow	Easiest if you upload it as a large set of .ZIP files
22:15 ^🔗	oxguy3	ah yeah, that sounds better than making 34+ items lol
22:16 ^🔗	SketchCow	Let's put it this way, it's going to be awful no matter what.
22:16 ^🔗	SketchCow	A few largish .zip files will do
22:16 ^🔗	astrid	a .zip per toplevel folder maybe
22:17 ^🔗	SketchCow	Yeah, not too many, and not too large
22:17 ^🔗	SketchCow	It's an art
22:17 ^🔗	astrid	there are many right answers and many wrong answers
22:17 ^🔗	oxguy3	problem with that is there are two top-level folders which surpass 50GB lol
22:17 ^🔗	oxguy3	i'll figure something out
22:18 ^🔗	astrid	the main constraint on archive.org items is that each item has to live on a hard drive with all of its files together
22:18 ^🔗	astrid	so if you exceed commercially shipping hard drives it won't be able to fit anywhere
22:18 ^🔗	astrid	and if you get close to it then it makes their end of things ... more complicated
22:19 ^🔗	oxguy3	yeah, i think i'm gonna make one item for a 72GB folder, one item for a 65GB folder, and one item for everything else (which totals 73GB)
22:19 ^🔗	astrid	oh those can live in one item together i'd say
22:20 ^🔗	astrid	three largeish zip files in a single item is a solid choice here
22:20 ^🔗	oxguy3	hmm, i thought the rule was no files over 50gb?
22:21 ^🔗	oxguy3	i was planning on splitting the two mega folders into zips for each sub item
22:21 ^🔗	astrid	imo better to keep them together so they don't get lost
22:22 ^🔗		dhyan_nat has quit IRC (Read error: Operation timed out)
22:22 ^🔗	oxguy3	i'll keep them in the same item, but divide them into multiple zips
22:24 ^🔗		jamiew_ has joined #archiveteam-ot
22:27 ^🔗		jamiew_ has quit IRC (Client Quit)
22:28 ^🔗		jamiew__ has quit IRC (Read error: Operation timed out)
22:35 ^🔗	oxguy3	alright they're slowly getting zipped -- my home server has a wimpy CPU so it's gonna be a while. aye carumba, what a messy project
22:35 ^🔗	markedL	the ftp site has same credentials as web site?
22:36 ^🔗	oxguy3	yep!
22:36 ^🔗	oxguy3	the website seems to just be an FTP client
22:45 ^🔗	markedL	it pretends to accept anonymous but dunno what email addresses it will aprove
22:48 ^🔗	oxguy3	i'll dm you the login (im not too concerned about sharing it now that i have a complete mirror)
23:05 ^🔗	markedL	what tools does ftp into warc ?
23:08 ^🔗	JAA	wpull can do that.
23:08 ^🔗	JAA	I don't think there's any standard on how to save FTP to WARC though.
23:09 ^🔗	oxguy3	i did it with wget
23:10 ^🔗	oxguy3	wget --user="vPR-Read" --password="removed" ftp://vikings.flashspot.tv/ --mirror --warc-file=vikings --warc-max-size=1G --warc-header="ftp-user: vPR-Read"
23:12 ^🔗	markedL	since the credentials are on the open web, I feel ethics are different but I'm not going to use them since it's done already
23:12 ^🔗	markedL	maybe someone else would prefer a web in warc copy
23:13 ^🔗	oxguy3	i have the full warc mirror fyi
23:13 ^🔗	Frogging	Is there really much point to capturing FTP to WARC, though?
23:14 ^🔗	Frogging	The files are all independent of each other and the headers are irrelevant
23:14 ^🔗	oxguy3	¯\_(ツ)_/¯ that's how the archiveteam ftp project does it so i just copied them
23:15 ^🔗	JAA	I've been asking myself that as well. What it's nice for is keeping the retrieval commands tightly coupled to the data.
23:15 ^🔗	Frogging	Just make sure to preserve the timestamps. zip or tar will do that for you
23:16 ^🔗	markedL	you get hashes
23:16 ^🔗	Frogging	(assuming you downloaded with something that preserves timestamps)
23:16 ^🔗	Frogging	FTP has hashes?
23:16 ^🔗	JAA	WARC does.
23:16 ^🔗	Frogging	ah
23:16 ^🔗	Frogging	find . -type f -exec md5sum {} + > md5sum.txt
23:16 ^🔗	oxguy3	wget created .listing files in every directory which included timestamps, so i figure that's probably good enough
23:16 ^🔗	JAA	FTP per standard doesn't, but there are extensions.
23:18 ^🔗	Frogging	wget should have applied the timestamps to the downloaded files
23:18 ^🔗	Frogging	which will be preserved if you zip/tar them
23:19 ^🔗	markedL	there needs to be more .warc support, it's a little repetitive defending archival properties for something that's not friendly to use
23:19 ^🔗	oxguy3	yeah i believe it did
23:57 ^🔗		martini has joined #archiveteam-ot

irclogger-viewer