Time |
Nickname |
Message |
00:08
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
00:11
🔗
|
|
Sk1d has joined #archiveteam-bs |
00:24
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
00:27
🔗
|
|
Sk1d has joined #archiveteam-bs |
00:40
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
00:45
🔗
|
|
Sk1d has joined #archiveteam-bs |
00:46
🔗
|
kiska |
Its not in my lists |
00:57
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
01:02
🔗
|
|
Sk1d has joined #archiveteam-bs |
01:13
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
01:18
🔗
|
|
Sk1d has joined #archiveteam-bs |
01:23
🔗
|
|
VerifiedJ has quit IRC (Quit: Leaving) |
01:30
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
01:35
🔗
|
|
Sk1d has joined #archiveteam-bs |
01:39
🔗
|
Flashfire |
Doomtay |
01:51
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
01:51
🔗
|
|
balrog has quit IRC (Quit: Bye) |
01:56
🔗
|
|
Sk1d has joined #archiveteam-bs |
01:56
🔗
|
|
balrog has joined #archiveteam-bs |
01:56
🔗
|
|
swebb sets mode: +o balrog |
02:07
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
02:11
🔗
|
|
Sk1d has joined #archiveteam-bs |
03:16
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
03:19
🔗
|
|
Sk1d has joined #archiveteam-bs |
03:35
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
03:37
🔗
|
|
Sk1d has joined #archiveteam-bs |
03:51
🔗
|
|
K4k has quit IRC (Read error: Connection reset by peer) |
03:58
🔗
|
|
m007a83_ is now known as m007a83 |
04:05
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
04:10
🔗
|
|
Sk1d has joined #archiveteam-bs |
04:42
🔗
|
|
qw3rty116 has joined #archiveteam-bs |
04:43
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
04:46
🔗
|
|
Sk1d has joined #archiveteam-bs |
04:47
🔗
|
|
qw3rty115 has quit IRC (Read error: Operation timed out) |
04:55
🔗
|
|
odemg has quit IRC (Ping timeout: 246 seconds) |
04:59
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
05:03
🔗
|
|
Sk1d has joined #archiveteam-bs |
05:05
🔗
|
|
zerkalo_ has quit IRC (Read error: Operation timed out) |
05:07
🔗
|
|
zerkalo has joined #archiveteam-bs |
05:09
🔗
|
|
jspiros has quit IRC (Read error: Operation timed out) |
05:09
🔗
|
|
sknebel_ has quit IRC (Read error: Operation timed out) |
05:10
🔗
|
|
Petri152 has quit IRC (Ping timeout: 246 seconds) |
05:11
🔗
|
|
joepie91 has quit IRC (Ping timeout: 246 seconds) |
05:11
🔗
|
|
odemg has joined #archiveteam-bs |
05:11
🔗
|
|
sknebel has joined #archiveteam-bs |
05:11
🔗
|
|
Coderjo has quit IRC (Ping timeout: 246 seconds) |
05:12
🔗
|
|
JAA has quit IRC (Ping timeout: 246 seconds) |
05:12
🔗
|
|
zyphlar has quit IRC (Ping timeout: 246 seconds) |
05:12
🔗
|
|
svchfoo1 has quit IRC (Ping timeout: 246 seconds) |
05:13
🔗
|
|
Coderjo has joined #archiveteam-bs |
05:15
🔗
|
|
c4rc4s has quit IRC (Read error: Operation timed out) |
05:15
🔗
|
|
Mayonaise has quit IRC (Ping timeout: 492 seconds) |
05:18
🔗
|
|
Muad-Dib has quit IRC (Ping timeout: 260 seconds) |
05:19
🔗
|
|
joepie91 has joined #archiveteam-bs |
05:20
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
05:22
🔗
|
|
Sk1d has joined #archiveteam-bs |
05:23
🔗
|
|
godane has quit IRC (Ping timeout: 252 seconds) |
05:25
🔗
|
|
Mayonaise has joined #archiveteam-bs |
05:29
🔗
|
|
Muad-Dib has joined #archiveteam-bs |
05:39
🔗
|
|
kidneybea has joined #archiveteam-bs |
05:53
🔗
|
|
kyounko has joined #archiveteam-bs |
06:07
🔗
|
|
godane has joined #archiveteam-bs |
06:10
🔗
|
|
svchfoo1 has joined #archiveteam-bs |
06:10
🔗
|
|
c4rc4s has joined #archiveteam-bs |
06:10
🔗
|
|
Petri152 has joined #archiveteam-bs |
06:10
🔗
|
|
zyphlar has joined #archiveteam-bs |
06:10
🔗
|
Ryz |
In regards to attempting to archive Lenny Letter - https://www.lennyletter.com/ - this one has a infinite scrolling feature, but you have to click on "Load More Stories" to get more of the articles |
06:10
🔗
|
|
svchfoo3 sets mode: +o svchfoo1 |
06:11
🔗
|
|
JAA has joined #archiveteam-bs |
06:11
🔗
|
|
swebb sets mode: +o JAA |
06:11
🔗
|
|
bakJAA sets mode: +o JAA |
06:13
🔗
|
|
jspiros has joined #archiveteam-bs |
06:24
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
06:25
🔗
|
|
zerkalo has quit IRC (Remote host closed the connection) |
06:28
🔗
|
|
Sk1d has joined #archiveteam-bs |
06:29
🔗
|
jodizzle |
Ryz: Seems like it has a detailed sitemap.xml: https://www.lennyletter.com/sitemap.xml |
06:29
🔗
|
jodizzle |
Which points to other XML files which seem to point to stories: https://www.lennyletter.com/sitemap.xml?year=2015&month=10&week=2 |
06:30
🔗
|
Flashfire |
is this a job for archivebot? |
06:30
🔗
|
Ryz |
Sadly, I have no idea how to extract the links in a way that's not cumbersome |
06:30
🔗
|
|
Atom-- has joined #archiveteam-bs |
06:30
🔗
|
Ryz |
I'm not sure if ArchiveBot can handle it if it has infinite scrolling |
06:31
🔗
|
Flashfire |
I gave it a try |
06:31
🔗
|
jodizzle |
Well it doesn't really if you can get the links from the XML files |
06:31
🔗
|
jodizzle |
Not sure if archiveboot will pick those up, but some crawlers probably will. |
06:31
🔗
|
jodizzle |
s/boot/bot |
06:33
🔗
|
|
Atom__ has quit IRC (Ping timeout: 252 seconds) |
06:34
🔗
|
|
Atom__ has joined #archiveteam-bs |
06:36
🔗
|
|
Atom-- has quit IRC (Ping timeout: 252 seconds) |
06:40
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
06:45
🔗
|
|
Sk1d has joined #archiveteam-bs |
06:57
🔗
|
jodizzle |
Ryz, Flashfire: Here's a list of links to posts extracted from the sitemaps, if it helps: https://transfer.sh/10HhxI/lenny-letter-list.txt |
06:57
🔗
|
jodizzle |
(Mostly I wanted a chance to play with shell commands.) |
07:08
🔗
|
Ryz |
Awesome, thanks jodizzle - let's hope that's all of the articles |
07:08
🔗
|
Ryz |
The lack of a timestamp on the URLs themselves makes this tricky |
07:10
🔗
|
|
kidneybea has quit IRC (Quit: Page closed) |
07:29
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
07:31
🔗
|
|
Sk1d has joined #archiveteam-bs |
07:44
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
07:48
🔗
|
|
Sk1d has joined #archiveteam-bs |
07:52
🔗
|
jodizzle |
Ryz: The sitemap files are timestampped with year, month and week, so we can at least know that much. |
07:53
🔗
|
jodizzle |
If you need it for some reason. |
08:01
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
08:04
🔗
|
|
Sk1d has joined #archiveteam-bs |
08:08
🔗
|
jodizzle |
Ryz, Flashfire: Here's Lenny Letter social media after snscraping, if you wanna throw this in archivebot: https://transfer.sh/169FA3/lenny-letter-facebook.txt https://transfer.sh/4THNf/lenny-letter-twitter.txt https://transfer.sh/wuEhd/lenny-letter-instagram.txt |
08:09
🔗
|
jodizzle |
And here's the youtube if anyone wants to get it: https://www.youtube.com/channel/UCDfky0ey-Gb6SqoF3wRhHWQ |
08:16
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
08:19
🔗
|
|
godane has quit IRC (Ping timeout: 268 seconds) |
08:19
🔗
|
|
Sk1d has joined #archiveteam-bs |
08:23
🔗
|
Ryz |
Now what to do with USA Today's Overwatch Wire website - https://overwatchwire.usatoday.com/ |
08:26
🔗
|
jodizzle |
I can snscrape the social media links. |
08:33
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
08:35
🔗
|
jodizzle |
Ryz: https://transfer.sh/QK5ZD/overwatchwire-twitter.txt https://transfer.sh/jTjhU/overwatchwire-facebook.txt |
08:36
🔗
|
|
Sk1d has joined #archiveteam-bs |
08:40
🔗
|
Ryz |
!status bc7mt8vl9cxi1t16gq99o8imr |
08:41
🔗
|
Ryz |
Whoops :p |
08:48
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
08:53
🔗
|
|
Sk1d has joined #archiveteam-bs |
09:01
🔗
|
|
Petri152 has quit IRC (Read error: Operation timed out) |
09:01
🔗
|
|
zyphlar has quit IRC (Read error: Operation timed out) |
09:01
🔗
|
|
JAA has quit IRC (Read error: Operation timed out) |
09:02
🔗
|
|
svchfoo1 has quit IRC (Read error: Operation timed out) |
09:02
🔗
|
|
jspiros has quit IRC (Read error: Operation timed out) |
09:03
🔗
|
|
c4rc4s has quit IRC (Read error: Operation timed out) |
09:12
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
09:13
🔗
|
|
achip has quit IRC (Read error: Operation timed out) |
09:14
🔗
|
|
Sk1d has joined #archiveteam-bs |
09:16
🔗
|
Ryz |
And thanks for the .txt files jodizzle once more |
09:17
🔗
|
|
achip has joined #archiveteam-bs |
10:00
🔗
|
|
jspiros has joined #archiveteam-bs |
10:00
🔗
|
|
zyphlar has joined #archiveteam-bs |
10:00
🔗
|
|
svchfoo1 has joined #archiveteam-bs |
10:01
🔗
|
|
c4rc4s has joined #archiveteam-bs |
10:01
🔗
|
|
Petri152 has joined #archiveteam-bs |
10:01
🔗
|
|
svchfoo3 sets mode: +o svchfoo1 |
10:02
🔗
|
|
JAA has joined #archiveteam-bs |
10:02
🔗
|
|
swebb sets mode: +o JAA |
10:02
🔗
|
|
bakJAA sets mode: +o JAA |
10:05
🔗
|
Ryz |
Hmm, unsure if such an article has been shared before here, but could take in an interest in archiving Chinese websites: https://qz.com/1166024/china-shut-down-over-13000-illegal-websites-and-10-million-user-accounts-since-2015/ |
10:06
🔗
|
|
BlueMax has quit IRC (Read error: Connection reset by peer) |
10:16
🔗
|
Ryz |
Not sure if The Debrief website is still accessible or not - https://thedebrief.co.uk/ - the website unfortunately coughs out the explicit message of "Your connection is not private" |
10:19
🔗
|
Ryz |
A bit of an eye-opener that The Debrief doesn't have a Wikipedia article yet |
10:53
🔗
|
JAA |
jodizzle, Ryz: ArchiveBot does parse sitemaps. It won't do the infinite scrolling though. |
11:30
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
11:34
🔗
|
|
Sk1d has joined #archiveteam-bs |
11:49
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
11:53
🔗
|
|
Sk1d has joined #archiveteam-bs |
12:08
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
12:08
🔗
|
|
zerkalo has joined #archiveteam-bs |
12:11
🔗
|
|
Sk1d has joined #archiveteam-bs |
12:22
🔗
|
|
godane has joined #archiveteam-bs |
12:23
🔗
|
|
godane has quit IRC (Client Quit) |
12:29
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
12:32
🔗
|
|
Sk1d has joined #archiveteam-bs |
12:36
🔗
|
|
godane has joined #archiveteam-bs |
12:42
🔗
|
|
Ryz has quit IRC (Quit: ChatZilla 0.9.92-rdmsoft [XULRunner 35.0.1/20150122214805]) |
12:45
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
12:45
🔗
|
|
twoTBHetz has quit IRC (Remote host closed the connection) |
12:50
🔗
|
|
Sk1d has joined #archiveteam-bs |
13:02
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
13:04
🔗
|
|
Sk1d has joined #archiveteam-bs |
13:17
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
13:17
🔗
|
|
Pixi` has quit IRC (Quit: Pixi`) |
13:20
🔗
|
|
Sk1d has joined #archiveteam-bs |
13:22
🔗
|
|
twoTBHetz has joined #archiveteam-bs |
13:33
🔗
|
VADemon |
Ryz if you read logs: thedebrief.co.uk is not accessible, it redirects to another website and has a bad SSL config that spews errors in browser |
13:34
🔗
|
|
Pixi has joined #archiveteam-bs |
13:35
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
13:40
🔗
|
|
Sk1d has joined #archiveteam-bs |
13:53
🔗
|
|
VerifiedJ has joined #archiveteam-bs |
14:05
🔗
|
|
Mayonaise has quit IRC (Read error: Operation timed out) |
14:06
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
14:08
🔗
|
|
Sk1d has joined #archiveteam-bs |
14:12
🔗
|
|
wp494 has quit IRC (Ping timeout: 260 seconds) |
14:13
🔗
|
|
wp494 has joined #archiveteam-bs |
14:22
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
14:27
🔗
|
|
Sk1d has joined #archiveteam-bs |
14:35
🔗
|
|
Mayonaise has joined #archiveteam-bs |
14:41
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
14:43
🔗
|
|
Sk1d has joined #archiveteam-bs |
15:02
🔗
|
|
VerifiedJ has quit IRC (Ping timeout: 252 seconds) |
15:11
🔗
|
|
Hiccup has joined #archiveteam-bs |
15:12
🔗
|
|
VerifiedJ has joined #archiveteam-bs |
15:13
🔗
|
Hiccup |
Anyone have any idea what might cause this pywb issue?: https://github.com/webrecorder/pywb/issues/406 |
15:13
🔗
|
Hiccup |
Or is pywb not a recommended "complicated website" recorder? |
15:13
🔗
|
Hiccup |
Also, does pywb work if you have to use special user agents and client certificates to access the website? |
15:32
🔗
|
|
VerifiedJ has quit IRC (Ping timeout: 252 seconds) |
15:36
🔗
|
JAA |
Hiccup: UAs are not a problem, but client certificates will likely be one. |
15:37
🔗
|
Hiccup |
Any idea how I can overcome the problem? |
15:37
🔗
|
Hiccup |
(btw the issue with pywb I posted above is probably unrelated as I tried it on normal websites too) |
15:42
🔗
|
JAA |
Hiccup: I think you said before that you need a proper browser as well, right? In that case, it'll probably be very hard. |
15:42
🔗
|
Hiccup |
well it doesn't need to be a proper browser |
15:42
🔗
|
Hiccup |
it can be a very basic browser |
15:43
🔗
|
|
VerifiedJ has joined #archiveteam-bs |
15:43
🔗
|
Hiccup |
might not even need JS |
15:43
🔗
|
JAA |
Oh |
15:43
🔗
|
JAA |
In that case, try wpull. |
15:43
🔗
|
JAA |
You can specify a client certificate with --certificate and the related options. |
15:44
🔗
|
Hiccup |
So that's something that will archive urls you feed it or work recursively? |
15:45
🔗
|
Hiccup |
That would be okay, except the website I want to archive doesn't actually use proper URLs |
15:45
🔗
|
JAA |
Can do both. |
15:45
🔗
|
Hiccup |
I'd need it to be recursive, but it would essentially need to grep the html for urls that are in custom tags |
15:45
🔗
|
Hiccup |
(this website is intended to be used with a custom browser) |
15:46
🔗
|
JAA |
In that case, you might have to use the get_urls hook to extract those additional URLs. |
16:14
🔗
|
Hiccup |
Actually |
16:14
🔗
|
Hiccup |
just recursive searching won't be enough. |
16:14
🔗
|
Hiccup |
There will need to be a manual element to it... |
16:14
🔗
|
Hiccup |
Are there any tools similar to pywb, that support client certs? |
16:29
🔗
|
moufu_ |
there's warcprox, but I'm not sure if it works with custom client certs |
16:29
🔗
|
|
moufu_ is now known as moufu |
16:33
🔗
|
JAA |
Yeah, I don't think it does since it opens its own SSL connection. |
16:33
🔗
|
JAA |
Same with all the other MITM proxies. |
16:38
🔗
|
Hiccup |
I may actually be able to use wget. I can just browse the website normally, noting URLs down (and their likely variations down), then run them all through wget. Then I can grep those files for any extra URLs. |
16:39
🔗
|
|
VerifiedJ has quit IRC (Ping timeout: 252 seconds) |
16:39
🔗
|
Hiccup |
I don't think there's any need to use WARC. I've setup wget to record all the HTTP Headers to a file, so having that + the files is basically going to be everything useful, I think. |
16:40
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
16:45
🔗
|
|
Sk1d has joined #archiveteam-bs |
16:50
🔗
|
VADemon |
WARC is the only way to get accepted into Wayback afaik everything else is disregarded as not legit Hiccup |
16:50
🔗
|
Hiccup |
yeah |
16:51
🔗
|
astrid |
warcs are the feedstock of wayback, but they have to be blessed because IA doesn't want to publish history that is false |
16:51
🔗
|
VADemon |
that adds to it ^ |
16:51
🔗
|
Hiccup |
I'm not sure if this could go into wayback anyway, because of the useragent+clientcerts |
16:54
🔗
|
moufu |
wget can output warcs too |
16:58
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
17:02
🔗
|
|
Sk1d has joined #archiveteam-bs |
17:11
🔗
|
Hiccup |
true |
17:11
🔗
|
Hiccup |
but in an old format I think |
17:11
🔗
|
Hiccup |
I'm not sure how IA could check if WARCs are genuine |
17:11
🔗
|
Hiccup |
WARCs from any random person like me |
17:18
🔗
|
astrid |
there isn't a technical solution to this social problem |
17:22
🔗
|
Hiccup |
you are probably right about that |
17:22
🔗
|
Hiccup |
its just a trust thing |
17:22
🔗
|
Hiccup |
websites aren't "self-verifying" or anything |
17:24
🔗
|
arkiver |
do we want a warrior job for hostinger? |
17:24
🔗
|
arkiver |
just got the ping from betamax |
17:25
🔗
|
|
Hiccup has quit IRC (Remote host closed the connection) |
17:25
🔗
|
arkiver |
we can look into that when we have a good list |
17:26
🔗
|
twoTBHetz |
i am currently working on a larger list for hostinger what we have so far is fine though |
17:26
🔗
|
twoTBHetz |
what is the problem with my current list? |
17:29
🔗
|
twoTBHetz |
(which was at https://pastebin.com/ANJQcSut ) |
17:29
🔗
|
twoTBHetz |
arkiver ? |
17:41
🔗
|
|
VerifiedJ has joined #archiveteam-bs |
18:21
🔗
|
|
ndiddy has joined #archiveteam-bs |
18:29
🔗
|
|
Martle_ has joined #archiveteam-bs |
18:29
🔗
|
|
Martle_ has quit IRC (Remote host closed the connection) |
18:30
🔗
|
|
Martle_ has joined #archiveteam-bs |
18:35
🔗
|
|
Martle has quit IRC (Read error: Operation timed out) |
18:46
🔗
|
twoTBHetz |
mhh i am now looking which websites have the depricated header otherwise you might have a problem with 200000 random sites which are not under threat at all |
18:52
🔗
|
|
Mateon1 has quit IRC (Ping timeout: 265 seconds) |
18:53
🔗
|
|
Mateon1 has joined #archiveteam-bs |
19:24
🔗
|
PurpleSym |
VoynichCr: This is your bot, right? ↑ |
19:25
🔗
|
|
ndiddy has quit IRC () |
19:26
🔗
|
VoynichCr |
PurpleSym: yes |
19:27
🔗
|
VoynichCr |
well, that were my edits, my bot sign edits with "BOT" in summary |
19:27
🔗
|
VoynichCr |
that were hand-made |
19:32
🔗
|
twoTBHetz |
would 1-16 mBit/s of request be considered a threat? |
19:33
🔗
|
PurpleSym |
VoynichCr: I added that info to your userpage. |
19:34
🔗
|
PurpleSym |
Shall we create a category for pages your bot updates? |
19:44
🔗
|
schbirid |
twoTBHetz: bandwidth is not a useful metric unless it is purely static files |
19:47
🔗
|
|
dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.) |
19:49
🔗
|
|
dashcloud has joined #archiveteam-bs |
19:49
🔗
|
|
Ryz has joined #archiveteam-bs |
20:04
🔗
|
VoynichCr |
PurpleSym: i created this https://www.archiveteam.org/index.php?title=Category:ArchiveBot |
20:08
🔗
|
PurpleSym |
This does not include the lists Death/Disestablishments in X though. I just thought it would be nice if there was a way to list all auto-updated pages. |
20:15
🔗
|
|
VADemon_ has joined #archiveteam-bs |
20:19
🔗
|
|
VADemon has quit IRC (Ping timeout: 255 seconds) |
20:25
🔗
|
betamax |
twoTBHetz: depending on the size of the lists, I would perhaps be inclined to get all the the urls regardless of whether or not they have the 'closing' banner |
20:26
🔗
|
betamax |
I know that some hostinger sites (myself included, before I moved off them) added bits of JS to prevent hostinger injecting extra bits into the site |
20:27
🔗
|
betamax |
so the fact that there is not a banner does not mean the site is going to stick around |
20:27
🔗
|
betamax |
I also don't trust hostinger, and getting a complete grab while we have the chance seems a good idea |
20:27
🔗
|
twoTBHetz |
mhh there are 15 uninteresting urls for each interesting one |
20:28
🔗
|
twoTBHetz |
the banner in html and my client executes no js so the banner will be there |
20:28
🔗
|
betamax |
ah, OK. |
20:29
🔗
|
betamax |
but could you keep a list of all the non-banner sites as well, as maybe what could happen is we could go after the at-risk ones first, then do the less-at-risk-but-still-Hostinger ones after |
20:31
🔗
|
twoTBHetz |
I am now at 58219 from 226717 from a reasonable selection from the huge json i got |
20:31
🔗
|
twoTBHetz |
i kept the 200MB Json with lines 1372060 and lots of dupplication |
20:32
🔗
|
twoTBHetz |
and irrelevant hostinger-clone urls like javahostinger |
20:32
🔗
|
betamax |
what do you mean by "reasonable selection"? |
20:32
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
20:33
🔗
|
twoTBHetz |
everything that has a cpanel will not be a free hosting |
20:34
🔗
|
betamax |
ah, so how did you work out which ones were free hosting? |
20:34
🔗
|
betamax |
or am I misunderstanding things? |
20:36
🔗
|
twoTBHetz |
i saw that a reasonable portion had also cpanel.dom.tld domains and i removed cpanel.dom.tld and dom.tld since it is not going to be a free hosting |
20:37
🔗
|
|
Sk1d has joined #archiveteam-bs |
20:38
🔗
|
betamax |
oh, I get it - thanks for the clarification |
20:38
🔗
|
twoTBHetz |
my orginal dataset this time was dns and there were lot's of dns level redirections to other hosters. |
20:39
🔗
|
twoTBHetz |
i removed redirectios to yahoo, google, stuff with "hosting-" in it but is still kept the orignal dns collection |
20:41
🔗
|
twoTBHetz |
or "ovh", "neko", "systems", "yaya", "natro" ... as substring |
20:43
🔗
|
twoTBHetz |
i am at 67258 |
20:48
🔗
|
twoTBHetz |
meaning more hours 5? we will see |
20:48
🔗
|
betamax |
great - it's a pain we don't have an exact deadline as to when they're closing |
20:49
🔗
|
twoTBHetz |
you already got a list from me :) |
20:49
🔗
|
twoTBHetz |
this with a second dataset |
20:53
🔗
|
twoTBHetz |
next time i do more than 100 parallel connections |
21:23
🔗
|
|
schbirid has quit IRC (Remote host closed the connection) |
21:24
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
21:27
🔗
|
|
Sk1d has joined #archiveteam-bs |
21:32
🔗
|
SmileyG |
can someone throw https://forum.gamestm.co.uk/index.php into archivebot |
21:32
🔗
|
SmileyG |
or tell me how so it works, concidering phpbb or something forum. |
21:39
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
21:42
🔗
|
|
BlueMax has joined #archiveteam-bs |
21:43
🔗
|
|
Sk1d has joined #archiveteam-bs |
22:45
🔗
|
twoTBHetz |
something weird has happened progress stopped dead in it tracks |
22:49
🔗
|
twoTBHetz |
no new site in 7 minutes but i can still curl them from the same machine |
22:49
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
22:52
🔗
|
|
Sk1d has joined #archiveteam-bs |
23:04
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
23:09
🔗
|
|
Sk1d has joined #archiveteam-bs |
23:10
🔗
|
twoTBHetz |
thats so weird i can access all entries up to a certain alphabetical entry. and after i can not reach it from the server |
23:12
🔗
|
|
Mateon1 has quit IRC (Remote host closed the connection) |
23:12
🔗
|
|
Mateon1 has joined #archiveteam-bs |
23:21
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
23:21
🔗
|
* |
twoTBHetz is highly confused and slightly paniced |
23:25
🔗
|
|
Sk1d has joined #archiveteam-bs |
23:39
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
23:42
🔗
|
|
Sk1d has joined #archiveteam-bs |
23:54
🔗
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
23:54
🔗
|
|
atomicthu has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.) |
23:54
🔗
|
|
atomicthu has joined #archiveteam-bs |
23:58
🔗
|
|
Sk1d has joined #archiveteam-bs |