#archiveteam-ot 2019-12-19,Thu

↑back Search

Time Nickname Message
00:07 🔗 girst has quit IRC (Remote host closed the connection)
01:07 🔗 bithippo has quit IRC (Textual IRC Client: www.textualapp.com)
01:41 🔗 synm0nger has quit IRC (Quit: Wait, what?)
01:57 🔗 SynMonger has joined #archiveteam-ot
02:11 🔗 DogsRNice has quit IRC (Read error: Connection reset by peer)
02:12 🔗 cerca has joined #archiveteam-ot
03:19 🔗 icedice has joined #archiveteam-ot
03:21 🔗 jamiew has joined #archiveteam-ot
03:30 🔗 SoraUta has joined #archiveteam-ot
03:37 🔗 jamiew has quit IRC (Textual IRC Client: www.textualapp.com)
03:42 🔗 cerca has quit IRC (Remote host closed the connection)
04:01 🔗 icedice has quit IRC (Read error: Operation timed out)
04:51 🔗 qw3rty has joined #archiveteam-ot
05:00 🔗 qw3rty2 has quit IRC (Ping timeout: 745 seconds)
05:14 🔗 markedL has joined #archiveteam-ot
05:15 🔗 markedL do people burn in new hard drives before putting data on them?
05:17 🔗 kpcyrd I don't, now I'm wondering if I should
05:27 🔗 Frogging I run badblocks
05:28 🔗 Frogging It writes to every byte on the device and then reads the whole device
05:29 🔗 godane has quit IRC (Ping timeout: 745 seconds)
05:31 🔗 Frogging ("writing to every byte" is an oversimplification because of sectors and all that, but you get the idea)
05:32 🔗 Frogging I don't know if there's any point to doing so, but I don't see any reason not to. If the drive fails after a few passes of that, you've saved yourself some trouble later on.
05:33 🔗 Frogging it'll also find any bad sectors; that's actually what badblocks is designed to do
05:34 🔗 Frogging I'm sure they do that at the factory already, but still, why not do it again? Consider it a pre-formatting scrub
05:40 🔗 godane has joined #archiveteam-ot
05:51 🔗 tuluu has quit IRC (Remote host closed the connection)
05:52 🔗 tuluu has joined #archiveteam-ot
06:26 🔗 Flashfire has quit IRC (Read error: Connection reset by peer)
06:27 🔗 Flashfire has joined #archiveteam-ot
07:08 🔗 deevious has joined #archiveteam-ot
07:12 🔗 dhyan_nat has joined #archiveteam-ot
08:11 🔗 deevious has quit IRC (Quit: deevious)
08:15 🔗 deevious has joined #archiveteam-ot
08:34 🔗 ShellyRol has quit IRC (Read error: Connection reset by peer)
08:37 🔗 bluefoo has quit IRC (Read error: Operation timed out)
08:51 🔗 ShellyRol has joined #archiveteam-ot
09:04 🔗 SoraUta has quit IRC (Ping timeout: 610 seconds)
09:39 🔗 dhyan_nat has quit IRC (Read error: Operation timed out)
09:54 🔗 Laverne has quit IRC (Ping timeout: 258 seconds)
09:54 🔗 mls has quit IRC (Ping timeout: 258 seconds)
09:54 🔗 VoynichCr has quit IRC (Ping timeout: 258 seconds)
09:54 🔗 sHATNER has quit IRC (Ping timeout: 258 seconds)
09:55 🔗 eythian has quit IRC (Ping timeout: 258 seconds)
09:55 🔗 luckcolor has quit IRC (Ping timeout: 258 seconds)
09:55 🔗 luckcolor has joined #archiveteam-ot
09:57 🔗 eythian has joined #archiveteam-ot
09:59 🔗 mls has joined #archiveteam-ot
10:00 🔗 sHATNER has joined #archiveteam-ot
10:26 🔗 deevious has quit IRC (Quit: deevious)
10:31 🔗 BlueMaxim has joined #archiveteam-ot
10:43 🔗 BlueMax has quit IRC (Ping timeout: 745 seconds)
10:49 🔗 deevious has joined #archiveteam-ot
10:59 🔗 VoynichCr has joined #archiveteam-ot
11:00 🔗 Laverne has joined #archiveteam-ot
12:23 🔗 BlueMaxim has quit IRC (Read error: Connection reset by peer)
12:47 🔗 bluefoo has joined #archiveteam-ot
13:20 🔗 jamiew has joined #archiveteam-ot
13:21 🔗 SoraUta has joined #archiveteam-ot
13:24 🔗 jamiew has quit IRC (Client Quit)
13:43 🔗 JAA Yeah, I also do essentially that (with SMART long test if available), and I also use fio to stress-test the mechanics for a few hours.
13:55 🔗 tuluu has quit IRC (Read error: Connection refused)
13:55 🔗 tuluu has joined #archiveteam-ot
14:45 🔗 bluefoo has quit IRC (Ping timeout: 744 seconds)
14:50 🔗 bluefoo has joined #archiveteam-ot
15:07 🔗 girst has joined #archiveteam-ot
15:15 🔗 deevious has quit IRC (Quit: deevious)
15:38 🔗 mc2 has quit IRC (Read error: Operation timed out)
15:58 🔗 dhyan_nat has joined #archiveteam-ot
16:41 🔗 jamiew has joined #archiveteam-ot
17:01 🔗 jamiew has quit IRC (Textual IRC Client: www.textualapp.com)
17:34 🔗 jamiew has joined #archiveteam-ot
17:56 🔗 ivan markedL: I do a write-read test using my http://github.com/ludios/drive-checker which has never actually caught a problem on a new drive for me (they tend to be tested before they get shipped out)
17:56 🔗 ivan but it does catch memory problems on the computer I use it on :-)
17:58 🔗 markedL I guess I'm equally concerned about damage from shipping or improper storage. they likely worked when it left the factory.
18:00 🔗 ivan that will show up as 'very DOA'
18:01 🔗 ivan noises, fail to spin up
18:09 🔗 qw3rty has quit IRC (Ping timeout: 745 seconds)
18:13 🔗 qw3rty has joined #archiveteam-ot
18:20 🔗 VerifiedJ has joined #archiveteam-ot
18:29 🔗 X-Scale` has joined #archiveteam-ot
18:29 🔗 LowLevelM has quit IRC (Read error: Operation timed out)
18:30 🔗 LowLevelM has joined #archiveteam-ot
18:34 🔗 X-Scale has quit IRC (Ping timeout: 610 seconds)
18:34 🔗 X-Scale` is now known as X-Scale
20:19 🔗 jamiew_ has joined #archiveteam-ot
20:20 🔗 MilkGames has joined #archiveteam-ot
20:20 🔗 jamiew_ has quit IRC (Client Quit)
20:21 🔗 MilkGames Hey there, how would I go about getting a web archive moved to the ArchiveTeam collection?
20:33 🔗 MilkGames Sorry, just realised this is the wrong channel to ask that in. I'll ask in another.
20:33 🔗 MilkGames has left
20:43 🔗 DogsRNice has joined #archiveteam-ot
21:06 🔗 jamiew_ has joined #archiveteam-ot
21:10 🔗 oxguy3 has joined #archiveteam-ot
21:16 🔗 oxguy3 uh hey, so i've got 204GB of gzipped WARC files from an FTP site... is there anything i should know before i attempt to upload this to archive.org with the ia command line tool? i've never uploaded anything remotely this big
21:25 🔗 astrid you might want to split it into a handful of distinct ia items, depending. up to you though.
21:25 🔗 astrid how big is each warc?
21:27 🔗 oxguy3 i set wget to target 1GB file size, but most are a bit bigger, and some are huge -- got a 12gb and an 11gb
21:27 🔗 astrid oh that's reasonable
21:28 🔗 astrid :)
21:29 🔗 oxguy3 yeah it's not too bad, and i figure it'd be best to keep them together so it's a singular package (i archived the entire FTP server lol)
21:29 🔗 oxguy3 i guess i'll get to uploading... this is gonna take a while lol
21:31 🔗 Kaz split to about 50gb/item ideally, I think is the common recommendation
21:33 🔗 oxguy3 i thought it was 50 gb/file and 1000 files/item? https://help.archive.org/hc/en-us/articles/360016475032-Uploading-Tips
21:37 🔗 oxguy3 oh wait, looks like items aren't supposed to be bigger than 100gb, shoot https://archive.org/services/docs/api/items.html#item-limitations
21:37 🔗 oxguy3 alright if i'm uploading this into multiple items, does it matter which item i put the meta warc file in?
21:40 🔗 Kaz yeah, don't make a 50tb item :)
21:40 🔗 Kaz I'm pretty sure it wouldn't be possible anyway, as I think all files for an item live together on a disk
21:54 🔗 oxguy3 i'm assuming i should set mediatype:web since this is WARC files, even though it's ftp rather than http, right?
22:03 🔗 jamiew__ has joined #archiveteam-ot
22:08 🔗 SketchCow Also, unless you're an approved archive team project, it won't go into wayback
22:09 🔗 oxguy3 yeah i figured, but will it still be browseable on archive.org?
22:09 🔗 SketchCow The item? Sure
22:09 🔗 jamiew_ has quit IRC (Read error: Operation timed out)
22:09 🔗 oxguy3 like as in, would you be able to browse the full contents of the server in some easy way, instead of having to dig through warc.gz files?
22:09 🔗 SketchCow Nope
22:10 🔗 oxguy3 ah, hmm. would it be better if i just uploaded the actual raw files instead of the WARCs? (i didnt include --delete-after in my wget command so i have them raw as well)
22:10 🔗 SketchCow I don't know how more or not more better it is.
22:11 🔗 SketchCow What FTP site is it
22:11 🔗 SketchCow Dare you to say Intel
22:11 🔗 oxguy3 vikings.flashspot.tv -- the Minnesota Vikings used it to share video and photos with the press for many years
22:11 🔗 SketchCow Is it still up
22:12 🔗 oxguy3 yes, but hasn't been updated since 2017
22:12 🔗 SketchCow Just pass this to archivebot to do it
22:12 🔗 SketchCow Then it's all handled and it goes in wayback
22:13 🔗 oxguy3 it requires a login, i wasn't sure if that would be an issue
22:14 🔗 SketchCow Upload the raw files.
22:14 🔗 SketchCow How many files is it.
22:14 🔗 oxguy3 okay cool
22:14 🔗 oxguy3 uhhh, a lot... let me see
22:14 🔗 SketchCow Either way, it's going to be a nightmare
22:14 🔗 oxguy3 33462
22:15 🔗 SketchCow Yeah, raw files, have a ball
22:15 🔗 SketchCow Easiest if you upload it as a large set of .ZIP files
22:15 🔗 oxguy3 ah yeah, that sounds better than making 34+ items lol
22:16 🔗 SketchCow Let's put it this way, it's going to be awful no matter what.
22:16 🔗 SketchCow A few largish .zip files will do
22:16 🔗 astrid a .zip per toplevel folder maybe
22:17 🔗 SketchCow Yeah, not too many, and not too large
22:17 🔗 SketchCow It's an art
22:17 🔗 astrid there are many right answers and many wrong answers
22:17 🔗 oxguy3 problem with that is there are two top-level folders which surpass 50GB lol
22:17 🔗 oxguy3 i'll figure something out
22:18 🔗 astrid the main constraint on archive.org items is that each item has to live on a hard drive with all of its files together
22:18 🔗 astrid so if you exceed commercially shipping hard drives it won't be able to fit anywhere
22:18 🔗 astrid and if you get close to it then it makes their end of things ... more complicated
22:19 🔗 oxguy3 yeah, i think i'm gonna make one item for a 72GB folder, one item for a 65GB folder, and one item for everything else (which totals 73GB)
22:19 🔗 astrid oh those can live in one item together i'd say
22:20 🔗 astrid three largeish zip files in a single item is a solid choice here
22:20 🔗 oxguy3 hmm, i thought the rule was no files over 50gb?
22:21 🔗 oxguy3 i was planning on splitting the two mega folders into zips for each sub item
22:21 🔗 astrid imo better to keep them together so they don't get lost
22:22 🔗 dhyan_nat has quit IRC (Read error: Operation timed out)
22:22 🔗 oxguy3 i'll keep them in the same item, but divide them into multiple zips
22:24 🔗 jamiew_ has joined #archiveteam-ot
22:27 🔗 jamiew_ has quit IRC (Client Quit)
22:28 🔗 jamiew__ has quit IRC (Read error: Operation timed out)
22:35 🔗 oxguy3 alright they're slowly getting zipped -- my home server has a wimpy CPU so it's gonna be a while. aye carumba, what a messy project
22:35 🔗 markedL the ftp site has same credentials as web site?
22:36 🔗 oxguy3 yep!
22:36 🔗 oxguy3 the website seems to just be an FTP client
22:45 🔗 markedL it pretends to accept anonymous but dunno what email addresses it will aprove
22:48 🔗 oxguy3 i'll dm you the login (im not too concerned about sharing it now that i have a complete mirror)
23:05 🔗 markedL what tools does ftp into warc ?
23:08 🔗 JAA wpull can do that.
23:08 🔗 JAA I don't think there's any standard on how to save FTP to WARC though.
23:09 🔗 oxguy3 i did it with wget
23:10 🔗 oxguy3 wget --user="vPR-Read" --password="removed" ftp://vikings.flashspot.tv/ --mirror --warc-file=vikings --warc-max-size=1G --warc-header="ftp-user: vPR-Read"
23:12 🔗 markedL since the credentials are on the open web, I feel ethics are different but I'm not going to use them since it's done already
23:12 🔗 markedL maybe someone else would prefer a web in warc copy
23:13 🔗 oxguy3 i have the full warc mirror fyi
23:13 🔗 Frogging Is there really much point to capturing FTP to WARC, though?
23:14 🔗 Frogging The files are all independent of each other and the headers are irrelevant
23:14 🔗 oxguy3 ¯\_(ツ)_/¯ that's how the archiveteam ftp project does it so i just copied them
23:15 🔗 JAA I've been asking myself that as well. What it's nice for is keeping the retrieval commands tightly coupled to the data.
23:15 🔗 Frogging Just make sure to preserve the timestamps. zip or tar will do that for you
23:16 🔗 markedL you get hashes
23:16 🔗 Frogging (assuming you downloaded with something that preserves timestamps)
23:16 🔗 Frogging FTP has hashes?
23:16 🔗 JAA WARC does.
23:16 🔗 Frogging ah
23:16 🔗 Frogging find . -type f -exec md5sum {} + > md5sum.txt
23:16 🔗 oxguy3 wget created .listing files in every directory which included timestamps, so i figure that's probably good enough
23:16 🔗 JAA FTP per standard doesn't, but there are extensions.
23:18 🔗 Frogging wget should have applied the timestamps to the downloaded files
23:18 🔗 Frogging which will be preserved if you zip/tar them
23:19 🔗 markedL there needs to be more .warc support, it's a little repetitive defending archival properties for something that's not friendly to use
23:19 🔗 oxguy3 yeah i believe it did
23:57 🔗 martini has joined #archiveteam-ot

irclogger-viewer