#archiveteam-bs 2017-03-12,Sun

↑back Search

Time Nickname Message
00:06 🔗 icedice JensRex: Ever considered using Scaleway? Cheaper and you can also get dedicated baremetal VPS servers from them
00:10 🔗 JensRex Nah, I like DO better.
00:15 🔗 Frogging dedicated baremetal VPS?
00:16 🔗 Frogging how does that work?
00:16 🔗 tobbez That sounds contradictory
00:19 🔗 Frogging there are single-tenant VPS but they're not baremetal. and the other kind of VPS wouldn't be dedicated either
00:21 🔗 username1 is now known as schbirid
00:22 🔗 schbirid is ftp.cs.princeton.edu know and archived already? i grabbed it the other day when it was posted on opendirectories
00:22 🔗 schbirid ~90G
00:51 🔗 rocode He means their actual baremetal offerings
00:51 🔗 rocode People have gotten used to calling everything VPSs
00:52 🔗 odemg has quit IRC (Remote host closed the connection)
00:55 🔗 schbirid has quit IRC (Quit: Leaving)
01:02 🔗 alfie has quit IRC (Ping timeout: 260 seconds)
01:05 🔗 alfie has joined #archiveteam-bs
02:21 🔗 j08nY has quit IRC (Quit: Leaving)
02:57 🔗 pizzaiolo has quit IRC (Remote host closed the connection)
03:13 🔗 kristian_ has quit IRC (Quit: Leaving)
04:10 🔗 DFJustin has quit IRC (Remote host closed the connection)
04:13 🔗 DFJustin has joined #archiveteam-bs
04:26 🔗 pnJay has quit IRC (Read error: Connection reset by peer)
04:27 🔗 pnJay has joined #archiveteam-bs
04:33 🔗 beardicus has quit IRC (Read error: Operation timed out)
04:50 🔗 beardicus has joined #archiveteam-bs
04:54 🔗 ndiddy has quit IRC ()
04:58 🔗 icedice has quit IRC (Quit: Leaving)
05:10 🔗 beardicus has quit IRC (Read error: Operation timed out)
05:12 🔗 godane i found 3 more episodes of call for help canada
05:12 🔗 godane 2 from 2004 and one from 2005
05:27 🔗 kristian_ has joined #archiveteam-bs
05:28 🔗 beardicus has joined #archiveteam-bs
05:48 🔗 kristian_ has quit IRC (Quit: Leaving)
05:49 🔗 SketchCow godane: I had someone look at the japanese scans collection
05:49 🔗 SketchCow His assessment is they're really, really super bad
05:49 🔗 SketchCow But I was glad to be tipped off
05:57 🔗 Sk1d has quit IRC (Ping timeout: 194 seconds)
06:00 🔗 Aranje has quit IRC (Ping timeout: 506 seconds)
06:03 🔗 Sk1d has joined #archiveteam-bs
06:04 🔗 Honno_ has joined #archiveteam-bs
06:23 🔗 godane i sort of figure they are bad scans
06:23 🔗 godane but i only seen a few images of it
06:24 🔗 godane and it didn't look bad from those images
06:30 🔗 Stilett0- has joined #archiveteam-bs
08:08 🔗 GE has joined #archiveteam-bs
08:55 🔗 dashcloud has quit IRC (Read error: Operation timed out)
08:57 🔗 midas has quit IRC (Read error: Operation timed out)
08:57 🔗 midas has joined #archiveteam-bs
08:59 🔗 dashcloud has joined #archiveteam-bs
09:31 🔗 GE has quit IRC (Quit: zzz)
09:31 🔗 GE has joined #archiveteam-bs
09:44 🔗 j08nY has joined #archiveteam-bs
09:58 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
10:32 🔗 schbirid has joined #archiveteam-bs
11:51 🔗 GE has quit IRC (Remote host closed the connection)
11:55 🔗 HCross2 I wouldn't touch anything Scaleway/Online.net as far as i could throw it
12:21 🔗 icedice has joined #archiveteam-bs
12:22 🔗 pizzaiolo has joined #archiveteam-bs
12:35 🔗 schbirid2 has joined #archiveteam-bs
12:38 🔗 schbirid has quit IRC (Read error: Operation timed out)
12:50 🔗 j08nY has quit IRC (Quit: Leaving)
13:33 🔗 odemg has joined #archiveteam-bs
14:05 🔗 icedice has quit IRC (Quit: Leaving)
14:07 🔗 pizzaiol1 has joined #archiveteam-bs
14:08 🔗 pizzaiolo has quit IRC (Ping timeout: 245 seconds)
14:20 🔗 SketchCow The guy who scanned Computer Gaming World contacted me and said "Hey, you know, I always put up the middle of the road resolutions. Want the full resolutions?"
14:20 🔗 SketchCow Yes, yes I do. He's uploading gigs of scans now.
14:21 🔗 SketchCow Also, I am happy to say that FOS is in the realm of getting normal
14:21 🔗 SketchCow Mostly because I made a second independent channel/drive for the IMDB nightmare
14:32 🔗 Roelandus has joined #archiveteam-bs
14:32 🔗 rocode Roelandus, have you already read through http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem ?
14:33 🔗 Roelandus yeah, I think so
14:33 🔗 rocode Are you just intending to read a warc file, or host a warc file?
14:33 🔗 Roelandus But I don't have the skills to use python
14:33 🔗 Roelandus I'm intending to read a .warc file
14:34 🔗 rocode Then I would recommend https://github.com/ikreymer/webarchiveplayer
14:34 🔗 Roelandus I'm using that program
14:34 🔗 Roelandus So I downloaded a test warc file from https://archive.org/details/testWARCfiles
14:35 🔗 Roelandus but it onyl shows me the links. When I click on those links it shows me a bunch of text instead of a web page
14:36 🔗 Roelandus Are you afk?
14:37 🔗 rocode I am looking into your issue, please be patient.
14:37 🔗 Roelandus Okay, thank you
14:40 🔗 Roelandus I have an image: https://gyazo.com/da2effe8c3e83612e7480987c69809a5
14:42 🔗 rocode You are opening the warc file, not the meta warc or cdx file?
14:42 🔗 rocode the 1gb file?
14:43 🔗 rocode I just tested it with one of my uploads: https://archive.org/download/allthingsrat.ditb.org-2017-03-01-6475a329/allthingsrat.ditb.org-2017-03-01-6475a329-00000.warc.gz and it works fine.
14:43 🔗 Roelandus the warc file, I downloaded the test torrent but it doesnt contain a .gz or .cdx file
14:44 🔗 rocode Try clicking on my link and using that file. It is only 53mb.
14:44 🔗 Roelandus awh shit, my download speed is fucking slow
14:44 🔗 Roelandus im at 5 MB :(
14:45 🔗 Roelandus 11 MB
14:46 🔗 Roelandus Why is this channel called archiveteam-bs (bullshit) seems a bit offensive to me
14:46 🔗 Roelandus 20.6 MB
14:47 🔗 Roelandus I have a 100 GB archive of a social-networking website on my PC it does have a .cdx but I can't open it with webarchiveplayer
14:48 🔗 Roelandus webarchiveplayer only opens .arc , .warc, .gz
14:49 🔗 rocode Because cdx is only a index of URLs that were grabbed.
14:49 🔗 rocode The warc file is the actual content.
14:49 🔗 Roelandus im almost done downloading your file
14:50 🔗 Roelandus alright
14:52 🔗 Roelandus that one seems to work
14:52 🔗 SketchCow sets mode: +o rocode
14:53 🔗 Roelandus yeah, your site works
14:53 🔗 Roelandus that test file is a bit weird though
14:53 🔗 Roelandus I have another question though
14:54 🔗 rocode Go for it. :)
14:54 🔗 Roelandus I got this 100GB .warc file which I mentioned previously. It takes forever to load it in.
14:55 🔗 Darkstar has quit IRC (Ping timeout: 370 seconds)
14:55 🔗 Roelandus Isn't there a way to just load the index URLs and then load in the indiv. pages
14:55 🔗 Roelandus I forgot the question mark
14:55 🔗 GE has joined #archiveteam-bs
14:55 🔗 rocode This unfortunately requires a tiny bit of that python you hated. Several tools exist to split warc files into smaller files. Warcat is the most popular. https://pypi.python.org/pypi/Warcat/
14:56 🔗 Roelandus I don't hate it, I just can't use it
14:56 🔗 Roelandus I have python installed on my pc
14:56 🔗 rocode You are on windows I am assuming?
14:57 🔗 Roelandus Windows 10, yes
14:58 🔗 Roelandus I have Python 3.5 installed
14:58 🔗 rocode In your bottom left Cortana search button, type "Powershell"
14:58 🔗 Roelandus cmds?
14:59 🔗 rocode Negatory. The program you need to launch is called "Windows Powershell"
14:59 🔗 Roelandus I already launched it
14:59 🔗 Roelandus I was asking for which commands I shall put in
15:00 🔗 rocode Get this script: https://bootstrap.pypa.io/get-pip.py
15:00 🔗 rocode Save it to your computer.
15:01 🔗 Roelandus woops
15:01 🔗 Roelandus already failed there
15:01 🔗 Roelandus how do I save it to my PC? just open notepad and paste the code in and save it as .py?
15:02 🔗 rocode Correct.
15:02 🔗 Roelandus done
15:04 🔗 rocode If you saved it in your home directory (i.e. C:\Users\<Username>\), and powershell shows you are in that directory, simply type python <name you saved the file as>
15:04 🔗 Roelandus I gotta cd to my desktop first
15:04 🔗 Roelandus I saved it there
15:04 🔗 rocode k
15:05 🔗 Roelandus I just moved it to that folder because that is the easiest way I think
15:06 🔗 Roelandus I got an error
15:06 🔗 Roelandus my input: "python code.py"
15:07 🔗 Roelandus output error: "python : The term 'python' is not recognized as the name of a cmdlet, function, script file, or operable program. Check the spelling of the name, or if a path was included, verify that the path is correct and try again. At line:1 char:1 + python code.py + ~~~~~~ + CategoryInfo : ObjectNotFound: (python:String) [], CommandNotFoundException + FullyQualifiedErrorId : CommandNotFoundException"
15:07 🔗 Roelandus I have python 3.5 32-bt
15:07 🔗 Roelandus *32-bit
15:07 🔗 Darkstar has joined #archiveteam-bs
15:08 🔗 VADemon hold on
15:08 🔗 Roelandus np
15:10 🔗 rocode Type this: $env:Path = "C:\Python35\";
15:11 🔗 VADemon https://dl.dropboxusercontent.com/u/53753604/screenshots/2017.03.12_16-08-29__explorer.png
15:11 🔗 VADemon hm yeah
15:11 🔗 rocode (If that is where your python is installed, please check first)
15:11 🔗 rocode VADemon, yeah, the installer includes the option to add to path, but it is off by default. :P
15:12 🔗 VADemon ...great
15:12 🔗 Roelandus Python35 is not in the C: folder
15:13 🔗 Roelandus Oh, I remember
15:13 🔗 Roelandus What was that environment table thing?
15:13 🔗 rocode $env:Path = "C:\Python35\"; replacing the stuff in the quotations with the actual path.
15:13 🔗 VADemon So you can access python from anywhere by just typing "python"
15:14 🔗 Roelandus I'll try to fix the environment table
15:15 🔗 Roelandus https://gyazo.com/b2f4b7888e956a6e694feab23e953f6e
15:16 🔗 Roelandus This is what it looks like with nothing changed
15:16 🔗 Roelandus I changed something in the past though
15:16 🔗 VADemon Did you enter rocod3's command though?
15:17 🔗 Roelandus No, because my python is located here "C:\Users\roels\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Python 3.5"
15:17 🔗 rocode No it isn't. Your start menu shortcut is located there.
15:17 🔗 VADemon This should be just the link to your installation
15:18 🔗 Roelandus woops
15:18 🔗 rocode Your actual install, if you left the options the same, should be C:\Python35
15:18 🔗 Roelandus If I do properties on that .exe it gives the path "C:\Users\roels\AppData\Local\Programs\Python\Python35-32\"
15:19 🔗 rocode You checked install for only me in the installer process then.
15:19 🔗 Roelandus probably
15:19 🔗 VADemon this should be it
15:19 🔗 rocode $env:Path = "C:\Users\roels\AppData\Local\Programs\Python\Python35-32\";
15:19 🔗 rocode Copy paste that into your powershell, then hit enter
15:20 🔗 Roelandus it shows nothing as output
15:20 🔗 Roelandus just a new line
15:20 🔗 Roelandus idk if that's good
15:20 🔗 rocode That's good.
15:20 🔗 rocode Now type python code.py
15:21 🔗 Roelandus yay
15:21 🔗 Roelandus oh shit
15:21 🔗 Roelandus it says it uninstalled pip :(
15:21 🔗 Roelandus it was already installed for some reason
15:22 🔗 rocode Standby
15:22 🔗 Roelandus now it's installed
15:22 🔗 Roelandus again
15:22 🔗 rocode Yep, it just upgraded your pip.
15:23 🔗 rocode Let it finish, then type pip install warcat
15:25 🔗 Roelandus error: "pip : The term 'pip' is not recognized as the name of a cmdlet, function, script file, or operable program. Check the pelling of the name, or if a path was included, verify that the path is correct and try again. At line:1 char:1 + pip install warcat + ~~~ + CategoryInfo : ObjectNotFound: (pip:String) [], CommandNotFoundException + FullyQualifiedErrorId : CommandNotFoundException"
15:26 🔗 rocode Standby
15:29 🔗 rocode Roelandus, do you still have the Python3.5 installer in your downloads?
15:30 🔗 Roelandus I got some kind of folder named platform-tools
15:31 🔗 Roelandus should I download a new installer?
15:31 🔗 rocode Standby
15:31 🔗 rocode Download this: https://www.python.org/ftp/python/3.6.1/python-3.6.1rc1-amd64.exe
15:31 🔗 Roelandus downloaded, open>
15:31 🔗 Roelandus ?
15:31 🔗 rocode During the installation process, you will get a list of components being installed.
15:32 🔗 Roelandus so, I should install it?
15:32 🔗 rocode Negate previous, at the bottom of the installer is a checkbox that says "Add Python3.6 to path"
15:33 🔗 rocode Check it.
15:33 🔗 Roelandus customize installation or install now button?
15:33 🔗 rocode Once you have that checked, install now button
15:33 🔗 rocode Restart your powershell
15:33 🔗 rocode (close and reopen)
15:34 🔗 Roelandus it's still installing
15:34 🔗 rocode Once it finishs, at the last screen is "Disable path limits"
15:35 🔗 rocode Click it
15:35 🔗 Roelandus it's almost done
15:35 🔗 Roelandus I'm still on HDD
15:36 🔗 Roelandus I clicked it
15:36 🔗 Roelandus but it doesn't show a new install of some sort
15:36 🔗 Roelandus *install->installer
15:36 🔗 rocode that's fine
15:37 🔗 Roelandus okay, what's next? :)
15:37 🔗 rocode Open up powershell and type pip install warcat
15:38 🔗 Roelandus it's done
15:38 🔗 rocode Standby.
15:41 🔗 Roelandus what keyboard do you use btw?
15:41 🔗 rocode Type: python -m warcat split <warcfile>
15:41 🔗 rocode Replacing <warcfile> with the one you wish to split
15:42 🔗 Roelandus should I cd to my warcfile location first?
15:42 🔗 rocode Yes
15:42 🔗 rocode I use this due to severe wrist pain: https://www.trulyergonomic.com/store/truly-ergonomic-mechanical-keyboard-soft-tactile-kailh-cherry-mx-compatible-brown-keyswitches-227-english
15:43 🔗 Roelandus uhm it seems like cd'ing in cmd is not the same as cd'ing in ms powershell
15:43 🔗 Roelandus that keyboard is more expensive than hhkb
15:44 🔗 Roelandus I see the problem
15:46 🔗 rocode In the future, to bypass all the windows bullshit we just had to go through, you may want to consider installing the Windows Subsystem for Linux, which makes all of this 100x easier. https://msdn.microsoft.com/en-us/commandline/wsl/install_guide
15:46 🔗 rocode If you end up doing a lot of CLi work.
15:46 🔗 GE has quit IRC (Quit: zzz)
15:46 🔗 GE has joined #archiveteam-bs
15:47 🔗 Roelandus I'll save that link
15:48 🔗 Roelandus sorry I had a folder name that powershell responded to in the wrong way
15:48 🔗 Roelandus I had a folder named "test environment"
15:48 🔗 Roelandus It saw environment as a command of some sort
15:49 🔗 Roelandus it's splitting them
15:50 🔗 Roelandus not in the way I wanted to (wrong folder) but I can post fix that
15:51 🔗 Roelandus how can I check the progress in %
15:52 🔗 rocode Windows does not have a method of generating progress bars for CLI commands that do not have them built in like linux does.
15:52 🔗 rocode (This is my subtle conversion process)
15:53 🔗 Roelandus that truly ergonomic company is pretty expensive
15:53 🔗 Roelandus $65 numpad with cherry mx
15:54 🔗 rocode I am not saying it is for everyone. I use them because they solved my wrist pain. I know there are cheaper options. But you asked what I used, not what I recommended. ;)
15:54 🔗 Roelandus yeah
15:55 🔗 Roelandus what do you think of hhkb excluding the lite2?
15:57 🔗 Roelandus can I already open a split .warc file with webarchiveplayer?
15:57 🔗 rocode Yes
15:58 🔗 Roelandus Can you tell me your age?
15:59 🔗 rocode If you are considering topre keyboards, I can only point you to /r/MechanicalKeyboards . I may have an expensive keyboard, but that stuff is beyond my purview. :P However, if you are going to spend that kind of money, I recommend you also look at https://ergodox-ez.com
16:00 🔗 Roelandus I need a new keyboard but I'm probably gonna buy a membrane or a clone mechanical
16:00 🔗 Roelandus I'm just 16 y/o
16:00 🔗 Roelandus So I can't afford the finest mx clears or topre
16:01 🔗 rocode If money is an issue, I recommend what I had at your age: http://www.daskeyboard.com/products/mechanical-keyboards/
16:03 🔗 Roelandus daskeyboard is by no means cheap
16:03 🔗 Roelandus I got €1800 on my bank account
16:03 🔗 Roelandus And I'm in like the highest 5 percent in my school
16:03 🔗 rocode http://www.ebay.com/sch/i.html?_from=R40&_trksid=p2050601.m570.l1313.TR12.TRC2.A0.H0.Xibm+model+m.TRS0&_nkw=ibm+model+m&_sacat=0
16:04 🔗 rocode Find a cheap, used IBM Model M
16:04 🔗 Roelandus yeah, that's a good idea they're like €40
16:04 🔗 rocode A good keyboard, properly maintained, will last you 10 years.
16:04 🔗 Roelandus And I get dem geekhack respect with my model M
16:04 🔗 Roelandus but I have a problem with the split warc files
16:05 🔗 rocode Oh?
16:05 🔗 Roelandus https://gyazo.com/7eec1d45473e55f8bce45a908b34ca51
16:06 🔗 Roelandus when I click on a random file it shows basically nothing useful
16:06 🔗 Roelandus it's from the hyves.nl archive
16:10 🔗 rocode Roelandus, select multiple warc files with the webarchiveplayer
16:11 🔗 rocode Actually, I have commited a cardinal sin. What are you actually trying to accomplish with this warc file?
16:11 🔗 rocode Find a specific record?
16:11 🔗 rocode Or just randomly browser?
16:11 🔗 Roelandus hyves.nl
16:11 🔗 rocode I understand, but what are you trying to achieve?
16:11 🔗 Roelandus I need to find a couple of pages so I can grab their images/videos
16:12 🔗 Roelandus I already ahve the username/groupname list
16:13 🔗 rocode Roelandus, did you get this warc file from IA?
16:13 🔗 Roelandus yes I beleive
16:14 🔗 Roelandus got it from here: http://www.archiveteam.org/index.php?title=Hyves but it points to internet archive I think
16:14 🔗 rocode I see. And you can't use the wayback machine because of their shitty robots.txt
16:14 🔗 Roelandus yeah, I don't get the robots.txt nonsense
16:14 🔗 rocode Unfortunately, you will need to use the built in searcher on the ENTIRE warc file, as slow as it is.
16:15 🔗 Roelandus The only thing I know is that it is used by search engines
16:15 🔗 rocode Jason Scott put it pretty well here: http://www.archiveteam.org/index.php?title=Robots.txt
16:16 🔗 Roelandus I already read half of that
16:17 🔗 Roelandus still I don't get the IA robots.txt
16:17 🔗 rocode IA follows robots.txt instructions. http://hyvesgames.nl/robots.txt is set up as a whitelist instead of a blacklist.
16:17 🔗 Roelandus I think the robots.txt works like this: search engine crawls -> sees robots.txt and acts accordingly
16:18 🔗 rocode IA follows robots.txt for several reasons, all of which I am not qualified to really elaborate on. It is the unfortunate reality of the situation.
16:18 🔗 Roelandus has it to do with legal stuff?
16:18 🔗 rocode Yes.
16:18 🔗 Roelandus yeah the website doesn't want their data to be 'stolen'
16:19 🔗 rocode Anyway, if you are looking for specific stuff within a warc, as far as I know, you will need to use the warc as a whole.
16:19 🔗 rocode Sorry.
16:19 🔗 Roelandus shit
16:19 🔗 Roelandus well...
16:19 🔗 rocode Anyway, I really should be getting back to work. If you need anything else, please feel free to ask in this channel. :)
16:19 🔗 Roelandus okay, but is there no way to speed up the loading progress?
16:20 🔗 Roelandus because I only want to see one page as a starting point
16:21 🔗 rocode As far as I know, no. Some others in this channel are way more experienced with actually using the actual warc file and may be able to help you, but we have reached the end of my expertise. :)
16:21 🔗 Roelandus okay, thanks for your time. Have a nice day sir :)
16:23 🔗 joepie91 Roelandus: you can grep through the index (CDX) file and then use the offsets to extract the corresponding WARC partially
16:23 🔗 joepie91 this will still take a good while though
16:23 🔗 tapedrive has quit IRC (Read error: Operation timed out)
16:24 🔗 tapedrive has joined #archiveteam-bs
16:26 🔗 godane SketchCow: i have a collection of 321 Contact
16:26 🔗 godane i maybe uploading that soon
16:27 🔗 Roelandus joepie91 nederlands?
16:29 🔗 godane Schools are encouraged to videotape the program, without fear of copyright violation, and to build the segments into the curriculum at convenient times. In the past, 26 countries have broadcast the series. West Germany, France and Spain have produced their own versions with native casts.
16:29 🔗 godane source about 321 Contact: http://www.nytimes.com/1984/10/02/science/about-education-forgotten-tv-audience-children.html
16:30 🔗 godane i hope that will be enough so we don't have to take collection down
16:42 🔗 Roelandus joepie91
16:57 🔗 joepie91 Roelandus: yes, but the default language here is English :)
16:57 🔗 Roelandus Can you help me?
16:58 🔗 joepie91 Roelandus: well, see my advice above - there's not really a faster way to do this, and I don't have the time atm to go through it step by step
17:00 🔗 Roelandus is there a specific program to partially extract the warc by the .cdx offset?
17:05 🔗 whydomain has quit IRC (Read error: Operation timed out)
17:06 🔗 whydomain has joined #archiveteam-bs
17:18 🔗 icedice has joined #archiveteam-bs
17:22 🔗 odemg has quit IRC (Remote host closed the connection)
17:40 🔗 whydomain has quit IRC (Read error: Operation timed out)
17:41 🔗 whydomain has joined #archiveteam-bs
17:43 🔗 kyounko has quit IRC (Read error: Connection reset by peer)
17:46 🔗 Roelandus How do I install wikiteam?
18:00 🔗 whydomain has quit IRC (Read error: Operation timed out)
18:01 🔗 whydomain has joined #archiveteam-bs
18:08 🔗 rocode Roelandus, http://www.archiveteam.org/index.php?title=ArchiveTeam_Warrior
18:48 🔗 RichardG_ has joined #archiveteam-bs
18:49 🔗 RichardG has quit IRC (Ping timeout: 260 seconds)
18:53 🔗 RichardG has joined #archiveteam-bs
18:56 🔗 odemg has joined #archiveteam-bs
18:56 🔗 odemg has quit IRC (Connection closed)
18:58 🔗 RichardG_ has quit IRC (Read error: Operation timed out)
19:07 🔗 spiko has quit IRC (Read error: Connection reset by peer)
19:12 🔗 bwn has quit IRC (Ping timeout: 244 seconds)
19:18 🔗 Aranje has joined #archiveteam-bs
19:34 🔗 j08nY has joined #archiveteam-bs
19:42 🔗 joepie91 [18:00] <Roelandus> is there a specific program to partially extract the warc by the .cdx offset?
19:42 🔗 joepie91 there might be, but warc files are just big gzipped balls of text
19:42 🔗 joepie91 so if you grab the offset, you can just read the specified segment from the file (using `head` and `tail` maybe?) and ungzip that
19:42 🔗 joepie91 and you get text and headers
19:43 🔗 joepie91 and the resource
19:47 🔗 Roelandus I'll probably let my server(old dell/hp pc) extract it overnight
19:47 🔗 Roelandus Because my PC is in my bedroom
19:49 🔗 Roelandus This would be a lot easier if the robots.txt from IA wouldn't block me
19:50 🔗 bwn has joined #archiveteam-bs
20:00 🔗 bwn Roelandus: were you able to get what you were after?
20:01 🔗 yipdw three
20:01 🔗 rocode Roelandus, IA is not supplying the robots.txt. The website you are trying to access is.
20:01 🔗 yipdw there's only a couple of directives in https://archive.org/robots.txt so I dunno what's up with that
20:01 🔗 yipdw or see above
20:01 🔗 rocode IA is just following the request of the website owner (as misguided as it may be.)
20:02 🔗 HCross2 I do wonder what caused that item to be added to the archive.org robots.txt
20:03 🔗 rocode Probably a legal order if I had to guess.
20:05 🔗 yipdw as far as extraction from an offset goes, yeah it's possible. there might be better tools for this, but one way to do it is to pull offsets out of the cdxes and issue range requests
20:05 🔗 yipdw https://gitlab.peach-bun.com/snippets/34
20:06 🔗 yipdw this only works if each record is compressed individually but that's that we do anyway
20:08 🔗 bwn i've had success using warctools' warcfilter utility to grab specific urls from a warc: https://github.com/internetarchive/warctools
20:11 🔗 odemg has joined #archiveteam-bs
20:12 🔗 yipdw does warctools support retrieving only a small portion of a large WARC? from https://github.com/internetarchive/warctools/blob/master/hanzo/warctools/stream.py, it's not clear to me if it does
20:42 🔗 phuzion has quit IRC (Read error: Operation timed out)
20:46 🔗 phuzion has joined #archiveteam-bs
20:52 🔗 Roelandus has quit IRC (Ping timeout: 268 seconds)
20:54 🔗 phuzion has quit IRC (Read error: Operation timed out)
20:57 🔗 phuzion has joined #archiveteam-bs
21:19 🔗 odemg has quit IRC (Remote host closed the connection)
21:22 🔗 Honno_ has quit IRC (Ping timeout: 370 seconds)
21:31 🔗 kristian_ has joined #archiveteam-bs
21:43 🔗 odemg has joined #archiveteam-bs
21:46 🔗 icedice has quit IRC (Quit: Leaving)
21:47 🔗 odemg has quit IRC (Remote host closed the connection)
21:56 🔗 BlueMaxim has joined #archiveteam-bs
22:06 🔗 kristian_ has quit IRC (Quit: Leaving)
22:12 🔗 no1spod has quit IRC (Quit: reboot)
22:31 🔗 Roelandus has joined #archiveteam-bs
22:33 🔗 odemg has joined #archiveteam-bs
22:38 🔗 Roelandus hello
22:44 🔗 GE has quit IRC (Ping timeout: 255 seconds)
22:45 🔗 GE has joined #archiveteam-bs
23:07 🔗 amiiboh has joined #archiveteam-bs
23:09 🔗 t2t2 has quit IRC (Read error: Operation timed out)
23:10 🔗 t2t2 has joined #archiveteam-bs
23:31 🔗 amiiboh has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
23:54 🔗 schbirid2 has quit IRC (Quit: Leaving)
23:59 🔗 Stiletto has joined #archiveteam-bs

irclogger-viewer