[00:07] *** girst has quit IRC (Remote host closed the connection)
[01:07] *** bithippo has quit IRC (Textual IRC Client: www.textualapp.com)
[01:41] *** synm0nger has quit IRC (Quit: Wait, what?)
[01:57] *** SynMonger has joined #archiveteam-ot
[02:11] *** DogsRNice has quit IRC (Read error: Connection reset by peer)
[02:12] *** cerca has joined #archiveteam-ot
[03:19] *** icedice has joined #archiveteam-ot
[03:21] *** jamiew has joined #archiveteam-ot
[03:30] *** SoraUta has joined #archiveteam-ot
[03:37] *** jamiew has quit IRC (Textual IRC Client: www.textualapp.com)
[03:42] *** cerca has quit IRC (Remote host closed the connection)
[04:01] *** icedice has quit IRC (Read error: Operation timed out)
[04:51] *** qw3rty has joined #archiveteam-ot
[05:00] *** qw3rty2 has quit IRC (Ping timeout: 745 seconds)
[05:14] *** markedL has joined #archiveteam-ot
[05:15] <markedL> do people burn in new hard drives before putting data on them?
[05:17] <kpcyrd> I don't, now I'm wondering if I should
[05:27] <Frogging> I run badblocks
[05:28] <Frogging> It writes to every byte on the device and then reads the whole device
[05:29] *** godane has quit IRC (Ping timeout: 745 seconds)
[05:31] <Frogging> ("writing to every byte" is an oversimplification because of sectors and all that, but you get the idea)
[05:32] <Frogging> I don't know if there's any point to doing so, but I don't see any reason not to. If the drive fails after a few passes of that, you've saved yourself some trouble later on.
[05:33] <Frogging> it'll also find any bad sectors; that's actually what badblocks is designed to do
[05:34] <Frogging> I'm sure they do that at the factory already, but still, why not do it again? Consider it a pre-formatting scrub
[05:40] *** godane has joined #archiveteam-ot
[05:51] *** tuluu has quit IRC (Remote host closed the connection)
[05:52] *** tuluu has joined #archiveteam-ot
[06:26] *** Flashfire has quit IRC (Read error: Connection reset by peer)
[06:27] *** Flashfire has joined #archiveteam-ot
[07:08] *** deevious has joined #archiveteam-ot
[07:12] *** dhyan_nat has joined #archiveteam-ot
[08:11] *** deevious has quit IRC (Quit: deevious)
[08:15] *** deevious has joined #archiveteam-ot
[08:34] *** ShellyRol has quit IRC (Read error: Connection reset by peer)
[08:37] *** bluefoo has quit IRC (Read error: Operation timed out)
[08:51] *** ShellyRol has joined #archiveteam-ot
[09:04] *** SoraUta has quit IRC (Ping timeout: 610 seconds)
[09:39] *** dhyan_nat has quit IRC (Read error: Operation timed out)
[09:54] *** Laverne has quit IRC (Ping timeout: 258 seconds)
[09:54] *** mls has quit IRC (Ping timeout: 258 seconds)
[09:54] *** VoynichCr has quit IRC (Ping timeout: 258 seconds)
[09:54] *** sHATNER has quit IRC (Ping timeout: 258 seconds)
[09:55] *** eythian has quit IRC (Ping timeout: 258 seconds)
[09:55] *** luckcolor has quit IRC (Ping timeout: 258 seconds)
[09:55] *** luckcolor has joined #archiveteam-ot
[09:57] *** eythian has joined #archiveteam-ot
[09:59] *** mls has joined #archiveteam-ot
[10:00] *** sHATNER has joined #archiveteam-ot
[10:26] *** deevious has quit IRC (Quit: deevious)
[10:31] *** BlueMaxim has joined #archiveteam-ot
[10:43] *** BlueMax has quit IRC (Ping timeout: 745 seconds)
[10:49] *** deevious has joined #archiveteam-ot
[10:59] *** VoynichCr has joined #archiveteam-ot
[11:00] *** Laverne has joined #archiveteam-ot
[12:23] *** BlueMaxim has quit IRC (Read error: Connection reset by peer)
[12:47] *** bluefoo has joined #archiveteam-ot
[13:20] *** jamiew has joined #archiveteam-ot
[13:21] *** SoraUta has joined #archiveteam-ot
[13:24] *** jamiew has quit IRC (Client Quit)
[13:43] <JAA> Yeah, I also do essentially that (with SMART long test if available), and I also use fio to stress-test the mechanics for a few hours.
[13:55] *** tuluu has quit IRC (Read error: Connection refused)
[13:55] *** tuluu has joined #archiveteam-ot
[14:45] *** bluefoo has quit IRC (Ping timeout: 744 seconds)
[14:50] *** bluefoo has joined #archiveteam-ot
[15:07] *** girst has joined #archiveteam-ot
[15:15] *** deevious has quit IRC (Quit: deevious)
[15:38] *** mc2 has quit IRC (Read error: Operation timed out)
[15:58] *** dhyan_nat has joined #archiveteam-ot
[16:41] *** jamiew has joined #archiveteam-ot
[17:01] *** jamiew has quit IRC (Textual IRC Client: www.textualapp.com)
[17:34] *** jamiew has joined #archiveteam-ot
[17:56] <ivan> markedL: I do a write-read test using my http://github.com/ludios/drive-checker which has never actually caught a problem on a new drive for me (they tend to be tested before they get shipped out)
[17:56] <ivan> but it does catch memory problems on the computer I use it on :-)
[17:58] <markedL> I guess I'm equally concerned about damage from shipping or improper storage.  they likely worked when it left the factory. 
[18:00] <ivan> that will show up as 'very DOA'
[18:01] <ivan> noises, fail to spin up
[18:09] *** qw3rty has quit IRC (Ping timeout: 745 seconds)
[18:13] *** qw3rty has joined #archiveteam-ot
[18:20] *** VerifiedJ has joined #archiveteam-ot
[18:29] *** X-Scale` has joined #archiveteam-ot
[18:29] *** LowLevelM has quit IRC (Read error: Operation timed out)
[18:30] *** LowLevelM has joined #archiveteam-ot
[18:34] *** X-Scale has quit IRC (Ping timeout: 610 seconds)
[18:34] *** X-Scale` is now known as X-Scale
[20:19] *** jamiew_ has joined #archiveteam-ot
[20:20] *** MilkGames has joined #archiveteam-ot
[20:20] *** jamiew_ has quit IRC (Client Quit)
[20:21] <MilkGames> Hey there, how would I go about getting a web archive moved to the ArchiveTeam collection?
[20:33] <MilkGames> Sorry, just realised this is the wrong channel to ask that in. I'll ask in another.
[20:33] *** MilkGames has left 
[20:43] *** DogsRNice has joined #archiveteam-ot
[21:06] *** jamiew_ has joined #archiveteam-ot
[21:10] *** oxguy3 has joined #archiveteam-ot
[21:16] <oxguy3> uh hey, so i've got 204GB of gzipped WARC files from an FTP site... is there anything i should know before i attempt to upload this to archive.org with the ia command line tool? i've never uploaded anything remotely this big
[21:25] <astrid> you might want to split it into a handful of distinct ia items, depending. up to you though.
[21:25] <astrid> how big is each warc?
[21:27] <oxguy3> i set wget to target 1GB file size, but most are a bit bigger, and some are huge -- got a 12gb and an 11gb
[21:27] <astrid> oh that's reasonable
[21:28] <astrid> :)
[21:29] <oxguy3> yeah it's not too bad, and i figure it'd be best to keep them together so it's a singular package (i archived the entire FTP server lol)
[21:29] <oxguy3> i guess i'll get to uploading... this is gonna take a while lol
[21:31] <Kaz> split to about 50gb/item ideally, I think is the common recommendation 
[21:33] <oxguy3> i thought it was 50 gb/file and 1000 files/item? https://help.archive.org/hc/en-us/articles/360016475032-Uploading-Tips
[21:37] <oxguy3> oh wait, looks like items aren't supposed to be bigger than 100gb, shoot https://archive.org/services/docs/api/items.html#item-limitations
[21:37] <oxguy3> alright if i'm uploading this into multiple items, does it matter which item i put the meta warc file in?
[21:40] <Kaz> yeah, don't make a 50tb item :)
[21:40] <Kaz> I'm pretty sure it wouldn't be possible anyway, as I think all files for an item live together on a disk
[21:54] <oxguy3> i'm assuming i should set mediatype:web since this is WARC files, even though it's ftp rather than http, right?
[22:03] *** jamiew__ has joined #archiveteam-ot
[22:08] <SketchCow> Also, unless you're an approved archive team project, it won't go into wayback
[22:09] <oxguy3> yeah i figured, but will it still be browseable on archive.org?
[22:09] <SketchCow> The item? Sure
[22:09] *** jamiew_ has quit IRC (Read error: Operation timed out)
[22:09] <oxguy3> like as in, would you be able to browse the full contents of the server in some easy way, instead of having to dig through warc.gz files?
[22:09] <SketchCow> Nope
[22:10] <oxguy3> ah, hmm. would it be better if i just uploaded the actual raw files instead of the WARCs? (i didnt include --delete-after in my wget command so i have them raw as well)
[22:10] <SketchCow> I don't know how more or not more better it is.
[22:11] <SketchCow> What FTP site is it
[22:11] <SketchCow> Dare you to say Intel
[22:11] <oxguy3> vikings.flashspot.tv -- the Minnesota Vikings used it to share video and photos with the press for many years
[22:11] <SketchCow> Is it still up
[22:12] <oxguy3> yes, but hasn't been updated since 2017
[22:12] <SketchCow> Just pass this to archivebot to do it
[22:12] <SketchCow> Then it's all handled and it goes in wayback
[22:13] <oxguy3> it requires a login, i wasn't sure if that would be an issue
[22:14] <SketchCow> Upload the raw files.
[22:14] <SketchCow> How many files is it.
[22:14] <oxguy3> okay cool
[22:14] <oxguy3> uhhh, a lot... let me see
[22:14] <SketchCow> Either way, it's going to be a nightmare
[22:14] <oxguy3> 33462
[22:15] <SketchCow> Yeah, raw files, have a ball
[22:15] <SketchCow> Easiest if you upload it as a large set of .ZIP files
[22:15] <oxguy3> ah yeah, that sounds better than making 34+ items lol
[22:16] <SketchCow> Let's put it this way, it's going to be awful no matter what.
[22:16] <SketchCow> A few largish .zip files will do
[22:16] <astrid> a .zip per toplevel folder maybe
[22:17] <SketchCow> Yeah, not too many, and not too large
[22:17] <SketchCow> It's an art
[22:17] <astrid> there are many right answers and many wrong answers
[22:17] <oxguy3> problem with that is there are two top-level folders which surpass 50GB lol
[22:17] <oxguy3> i'll figure something out
[22:18] <astrid> the main constraint on archive.org items is that each item has to live on a hard drive with all of its files together
[22:18] <astrid> so if you exceed commercially shipping hard drives it won't be able to fit anywhere
[22:18] <astrid> and if you get close to it then it makes their end of things ... more complicated
[22:19] <oxguy3> yeah, i think i'm gonna make one item for a 72GB folder, one item for a 65GB folder, and one item for everything else (which totals 73GB)
[22:19] <astrid> oh those can live in one item together i'd say
[22:20] <astrid> three largeish zip files in a single item is a solid choice here
[22:20] <oxguy3> hmm, i thought the rule was no files over 50gb?
[22:21] <oxguy3> i was planning on splitting the two mega folders into zips for each sub item
[22:21] <astrid> imo better to keep them together so they don't get lost
[22:22] *** dhyan_nat has quit IRC (Read error: Operation timed out)
[22:22] <oxguy3> i'll keep them in the same item, but divide them into multiple zips
[22:24] *** jamiew_ has joined #archiveteam-ot
[22:27] *** jamiew_ has quit IRC (Client Quit)
[22:28] *** jamiew__ has quit IRC (Read error: Operation timed out)
[22:35] <oxguy3> alright they're slowly getting zipped -- my home server has a wimpy CPU so it's gonna be a while. aye carumba, what a messy project
[22:35] <markedL> the ftp site has same credentials as web site?
[22:36] <oxguy3> yep!
[22:36] <oxguy3> the website seems to just be an FTP client
[22:45] <markedL> it pretends to accept anonymous but dunno what email addresses it will aprove 
[22:48] <oxguy3> i'll dm you the login (im not too concerned about sharing it now that i have a complete mirror)
[23:05] <markedL> what tools does ftp into warc ? 
[23:08] <JAA> wpull can do that.
[23:08] <JAA> I don't think there's any standard on how to save FTP to WARC though.
[23:09] <oxguy3> i did it with wget
[23:10] <oxguy3> wget --user="vPR-Read" --password="removed" ftp://vikings.flashspot.tv/ --mirror --warc-file=vikings --warc-max-size=1G --warc-header="ftp-user: vPR-Read"
[23:12] <markedL> since the credentials are on the open web, I feel ethics are different but I'm not going to use them since it's done already
[23:12] <markedL> maybe someone else would prefer a web in warc copy 
[23:13] <oxguy3> i have the full warc mirror fyi
[23:13] <Frogging> Is there really much point to capturing FTP to WARC, though?
[23:14] <Frogging> The files are all independent of each other and the headers are irrelevant
[23:14] <oxguy3> ¯\_(ツ)_/¯ that's how the archiveteam ftp project does it so i just copied them
[23:15] <JAA> I've been asking myself that as well. What it's nice for is keeping the retrieval commands tightly coupled to the data.
[23:15] <Frogging> Just make sure to preserve the timestamps. zip or tar will do that for you
[23:16] <markedL> you get hashes
[23:16] <Frogging> (assuming you downloaded with something that preserves timestamps)
[23:16] <Frogging> FTP has hashes?
[23:16] <JAA> WARC does.
[23:16] <Frogging> ah
[23:16] <Frogging> find . -type f -exec md5sum {} + > md5sum.txt
[23:16] <oxguy3> wget created .listing files in every directory which included timestamps, so i figure that's probably good enough
[23:16] <JAA> FTP per standard doesn't, but there are extensions.
[23:18] <Frogging> wget should have applied the timestamps to the downloaded files
[23:18] <Frogging> which will be preserved if you zip/tar them
[23:19] <markedL> there needs to be more .warc support, it's a little repetitive defending archival properties for something that's not friendly to use 
[23:19] <oxguy3> yeah i believe it did
[23:57] *** martini has joined #archiveteam-ot